Finding a variable count given another variable in Python - python

Using Python 3
I am trying to pull the total count for each group:
1 - control and convert
2 - treatment and convert
control = df2[df2.group == 'control']
treatment = df2[df2.group == 'treatment']
old = df2[df2.landing_page == 'old']
new = df2[df2.landing_page == 'convert']
I've tried a couple different things:
control.user_id.count() + convert.user_id.count()
But this just adds both groups up.
I also tried a groupby but I can't get the syntax to work.
df2.groupby(df2[df2.group =='control',
'old']).landing_page().reset_index(name='Count')
What is the best way to pull a group given the presence in another group?

Are you looking for something like this?
Two arrays:
a = [(1,2.1232),(3,5)]
b = [(1,2.1232),(5,5)]
List comprehension for finding how many of a are in b:
sum([x in a for x in b])
Returns: 1

Related

Extracting multiple data from a single list

I working on a text file that contains multiple information. I converted it into a list in python and right now I'm trying to separate the different data into different lists. The data is presented as following:
CODE/ DESCRIPTION/ Unity/ Value1/ Value2/ Value3/ Value4 and then repeat, an example would be:
P03133 Auxiliar helper un 203.02 417.54 437.22 675.80
My approach to it until now has been:
Creating lists to storage each information:
codes = []
description = []
unity = []
cost = []
Through loops finding a code, based on the code's structure, and using the code's index as base to find the remaining values.
Finding a code's easy, it's a distinct type of information amongst the other data.
For the remaining values I made a loop to find the next value that is numeric after a code. That way I can delimitate the rest of the indexes:
The unity would be the code's index + index until isnumeric - 1, hence it's the first information prior to the first numeric value in each line.
The cost would be the code's index + index until isnumeric + 2, the third value is the only one I need to store.
The description is a little harder, the number of elements that compose it varies across the list. So I used slicing starting at code's index + 1 and ending at index until isnumeric - 2.
for i, carc in enumerate(txtl):
if carc[0] == "P" and carc[1].isnumeric():
codes.append(carc)
j = 0
while not txtl[i+j].isnumeric():
j = j + 1
description.append(" ".join(txtl[i+1:i+j-2]))
unity.append(txtl[i+j-1])
cost.append(txtl[i+j])
I'm facing some problems with this approach, although there will always be more elements to the list after a code I'm getting the error:
while not txtl[i+j].isnumeric():
txtl[i+j] list index out of range.
Accepting any solution to debug my code or even new solutions to problem.
OBS: I'm also going to have to do this to a really similar data font, but the code would be just a sequence of 7 numbers, thus harder to find amongst the other data. Any solution that includes this facet is also appreciated!
A slight addition to your code should resolve this:
while i+j < len(txtl) and not txtl[i+j].isnumeric():
j += 1
The first condition fails when out of bounds, so the second one doesn't get checked.
Also, please use a list of dict items instead of 4 different lists, fe:
thelist = []
thelist.append({'codes': 69, 'description': 'random text', 'unity': 'whatever', 'cost': 'your life'})
In this way you always have the correct values together in the list, and you don't need to keep track of where you are with indexes or other black magic...
EDIT after comment interactions:
Ok, so in this case you split the line you are processing on the space character, and then process the words in the line.
from pprint import pprint # just for pretty printing
textl = 'P03133 Auxiliar helper un 203.02 417.54 437.22 675.80'
the_list = []
def handle_line(textl: str):
description = ''
unity = None
values = []
for word in textl.split()[1:]:
# it splits on space characters by default
# you can ignore the first item in the list, as this will always be the code
# str.isnumeric() doesn't work with floats, only integers. See https://stackoverflow.com/a/23639915/9267296
if not word.replace(',', '').replace('.', '').isnumeric():
if len(description) == 0:
description = word
else:
description = f'{description} {word}' # I like f-strings
elif not unity:
# if unity is still None, that means it has not been set yet
unity = word
else:
values.append(word)
return {'code': textl.split()[0], 'description': description, 'unity': unity, 'values': values}
the_list.append(handle_line(textl))
pprint(the_list)
str.isnumeric() doesn't work with floats, only integers. See https://stackoverflow.com/a/23639915/9267296

How to make my code more efficient so it has a runtime of less than 1sec

I had a test recently (it's over now) and I had this question:
question
and I could do the question. However there was one test case at the end that said that we get an extra 10 marks if the runtime of that case is <1s. However, my issue was that I could not get the runtime to be below 1sec. My code and the test case are below. Also I should add we're not allowed to import any packages.
Test case:
longList = [i for i in range(100000)]
print(calculateAreas(longList,longList[::-1]))
My code:
def calculateAreas(w_list,h_list):
width_length = len(w_list)
height_length = len(h_list)
list_of_areas = []
red_area = 0
white_area = 0
yellow_area = 0
for i in range(width_length):
for n in range(height_length):
if (i+n-2) % 3 == 0:
red_area += (w_list[i] * h_list[n])
if (i+n-2) % 3 == 1:
white_area += (w_list[i] * h_list[n])
if (i+n-2) % 3 == 2:
yellow_area += (w_list[i] * h_list[n])
list_of_areas.insert(0, white_area)
list_of_areas.insert(1, yellow_area)
list_of_areas.insert(2, red_area)
tuple_area = tuple(list_of_areas)
return tuple_area
I would say that the way to speed up is using list comprehension and trying to make only the necessary operations. There is a pattern on the both axis of white -> yellow -> red -> white..., we can use list comprehension to separate these colors and then find the total area of each one of them.
to separate a color in a axis we can use:
list[::3]
so say we could sum up all the white values on the w_list[0] and multiply by the sum of all white values on the h_list[0], we would get about 1/3 of whites total value. So we could repeate this for w_list[1] with h_list[1] and w_list[2] with h_list[2]. In short, what I'm trying to do is separate white with 3 grids like this one
W--W--W-
+--+--+-
+--+--+-
W--W--W-
+--+--+-
slightly dislocated one from the other and using list comprehension to isolate and get the area without having to make nested for loops:
def calculateAreas(w_list, h_list):
white = 0
yellow = 0
red = 0
for i, j in zip([0,1,2], [0,2,1]):
white += sum(w_list[i::3]) * sum(h_list[j::3])
yellow += sum(w_list[i::3]) * sum(h_list[(j+1)%3::3])
red += sum(w_list[i::3]) * sum(h_list[(j+2)%3::3])
return (white, yellow, red)
this way we are only passing 3 times threw the for loop and on a list of 100000 elements it's timing 0.0904s on my slow laptop. If I can give you a couple tips about your code: 1- try to interact directly over the list elements (use enumerate) 2- use 'elif' and 'else' statements (if a color checked 'white', you dont need to check if its red). And in general, if you need to speed up a code, try avoiding nested loops, imagine interacting every element in one list with every other element in another list: thats len(list)**2!

Can you use if statements to create variables?

I am trying to make the switch from STATA to python for data analysis and I'm running into some hiccups that I'd like some help with. I am attempting to create a secondary variable based on some values in an original variable. I want to create a binary variable which identifies fall accidents (E-codes E880.xx -E888.xx) with a value of 1, and all other e-codes with a value of 0. in a list of ICD-9 codes with over 10,000 rows, so manual imputation isn't possible.
in STATA the code would look something like this
newvar= 0
replace newvar = 1 if ecode_variable == "E880"
replace newvar = 1 if ecode_variable == "E881"
etc
I tried a similar statement in python, but it's not working
data['ecode_fall'] = 1 if data['ecode'] == 'E880'
is this type of work possible in python? Is there a function in the numpy or pandas packages that could help with this.
I've also tried creating a dictionary variable which calls the fall injury codes 1 and applying it to the variable to no avail.
Put the if first.
if data['ecode'] == 'E880': data['ecode_fall'] = 1
you can break it out into two lines like this:
if data['ecode'] == 'E880':
data['ecode_fall'] = 1
or if you include an else statement you can have it in one line, similar syntax to your SATA code:
data['ecode_fall'] = 1 if data['ecode'] == 'E880' else None
Following from the other answers, you can also check multiple values at once like so:
if data['ecode'] in ('E880', 'E881', ...):
data['ecode_fall'] = 1
this leaves you having to only do one if statement per unique value of data['ecode_fall'].

Python: replace for loop with function

Can anyone help me to understand how I would create a function with def whatever() instead of using a for loop. I'm trying to do thing more Pythonically but don't really understand how to apply a function well instead of a loop. For instance, I have a loop below that works well and gives the output I would like, is there a way to do this with a function?
seasons = leaguesFinal['season'].unique()
teams = teamsDF['team_long_name'].unique()
df = []
for i in seasons:
season = leaguesFinal['season'] == i
season = leaguesFinal[season]
for j in teams:
team_season_wins = season['win'] == j
team_season_win_record = team_season_wins[team_season_wins].count()
team_season_loss = season['loss'] == j
team_season_loss_record = team_season_loss[team_season_loss].count()
df.append((j, i, team_season_win_record, team_season_loss_record))
df = pd.DataFrame(df, columns=('Team', 'Seasons', 'Wins', 'Losses'))
The output looks as follows:
Team Seasons Wins Losses
0 KRC Genk 2008/2009 15 14
1 Beerschot AC 2008/2009 11 14
2 SV Zulte-Waregem 2008/2009 16 11
3 Sporting Lokeren 2008/2009 13 9
4 KSV Cercle Brugge 2008/2009 14 15
Solution
def some_loop(something, something_else):
for i in something:
season = leaguesFinal['sesaon'] == i
season = leaguesFinal[season]
for j in something_else:
team_season_wins = season['win'] == j
team_season_win_record = team_season_wins[team_season_wins].count()
team_season_loss = season['loss'] == j
team_season_loss_record = team_season_loss[team_season_loss].count()
df.append((j, i, team_season_win_record, team_season_loss_record))
some_loop(seasons, teams)
Comments
This is what you are mentioning, creating a function out of the for loop although you still have a for loop its in a function that you can use in different areas of your code without re-using the entire code for the loop.
All there is to to is define a function that accepts two variables for this particular loop that would be def some_loop(something, something_else), I used basic naming so you could see clearer whats taking place.
Then you would replace all the instanes of seasons and teams with those variables.
Now you call your function will replace all occurences of something and something_else with whatever inputs you send to it.
Also I am not completely sure of the statements that involve x = y = i and what this accomplishes or if its even a valid statement?
actually youre mixing stuff up - functions just aggregate lines of code and thus make them reproducable without writing everything again, whereas for-loops are for iteration purposes.
In your above mentioned example, a function would just contain the for-loop and return the resulting dataframe, which you could use then. but it will not change anything or make your code smarter.

make a global condition break

allow me to preface this by saying that i am learning python on my own as part of my own curiosity, and i was recommended a free online computer science course that is publicly available, so i apologize if i am using terms incorrectly.
i have seen questions regarding this particular problem on here before - but i have a separate question from them and did not want to hijack those threads. the question:
"a substring is any consecutive sequence of characters inside another string. The same substring may occur several times inside the same string: for example "assesses" has the substring "sses" 2 times, and "trans-Panamanian banana" has the substring "an" 6 times. Write a program that takes two lines of input, we call the first needle and the second haystack. Print the number of times that needle occurs as a substring of haystack."
my solution (which works) is:
first = str(input())
second = str(input())
count = 0
location = 0
while location < len(second):
if location == 0:
location = str.find(second,first,0)
if location < 0:
break
count = count + 1
location = str.find(second,first,location +1)
if location < 0:
break
count = count + 1
print(count)
if you notice, i have on two separate occasions made the if statement that if location is less than 0, to break. is there some way to make this a 'global' condition so i do not have repetitive code? i imagine efficiency becomes paramount with increasing program sophistication so i am trying to develop good practice now.
how would python gurus optimize this code or am i just being too nitpicky?
I think Matthew and darshan have the best solution. I will just post a variation which is based on your solution:
first = str(input())
second = str(input())
def count_needle(first, second):
location = str.find(second,first)
if location == -1:
return 0 # none whatsoever
else:
count = 1
while location < len(second):
location = str.find(second,first,location +1)
if location < 0:
break
count = count + 1
return count
print(count_needle(first, second))
Idea:
use function to structure the code when appropriate
initialise the variable location before entering the while loop save you from checking location < 0 multiple times
Check out regular expressions, python's re module (http://docs.python.org/library/re.html). For example,
import re
first = str(input())
second = str(input())
regex = first[:-1] + '(?=' + first[-1] + ')'
print(len(re.findall(regex, second)))
As mentioned by Matthew Adams the best way to do it is using python'd re module Python re module.
For your case the solution would look something like this:
import re
def find_needle_in_heystack(needle, heystack):
return len(re.findall(needle, heystack))
Since you are learning python, best way would be to use 'DRY' [Don't Repeat Yourself] mantra. There are lots of python utilities that you can use for many similar situation.
For a quick overview of few very important python modules you can go through this class:
Google Python Class
which should only take you a day.
even your aproach could be imo simplified (which uses the fact, that find returns -1, while you aks it to search from non existent offset):
>>> x = 'xoxoxo'
>>> start = x.find('o')
>>> indexes = []
>>> while start > -1:
... indexes.append(start)
... start = x.find('o',start+1)
>>> indexes
[1, 3, 5]
needle = "ss"
haystack = "ssi lass 2 vecess estan ss."
print 'needle occurs %d times in haystack.' % haystack.count(needle)
Here you go :
first = str(input())
second = str(input())
x=len(first)
counter=0
for i in range(0,len(second)):
if first==second[i:(x+i)]:
counter=counter+1
print(counter)
Answer
needle=input()
haystack=input()
counter=0
for i in range(0,len(haystack)):
if(haystack[i:len(needle)+i]!=needle):
continue
counter=counter+1
print(counter)

Categories