Most efficient way to access Pandas dataframes? - python

I have a dataframe in Python that I want to trace through in a very specific way and I'm very new to using Pandas so I need some advice on how best to do this. This dataframe has information on many, many video games released over the course of history. Each row is an entry for a particular video game and each column contains info such as game names, release years, sales numbers, and console platforms (the same game appears multiple times if released on multiple platforms).
I want to do some calculations on sales figures based on release consoles over particular dates. The most obvious way of doing this is, of course, manually looping over every row in the dataframe checking to see if entries match my particular requirements for a calculation.
This is how I plan to do my traversals:
for s in frame.iterrows():
if s[1][1] == "Wii":
print (s[1][1]) ##As a test, I can print out the names of Wii games
My question is if this is the "correct" or most efficient way to do this, which I assume it's not. Pandas seems to have a TON of useful methods for dataframes and I would like to know if it contains a more efficient method for only looking up data with certain prerequisites.

assuming you want wii games an easy way to do this is the following. Let's take a toy dataframe example:
# Dataframe 'games':
console title
0 Xbox Halo
1 Wii Smash Bros
To get all the rows with wii games, you can run
games[games["console"] == "Wii"]
# returns
console title
1 Wii Smash Bros
Hope this helps! Let me know if you have any follow up questions/want more detail

Related

Handling Multiple if/else and Special Cases

So I'm fairly new to coding only having relatively simple scripts here and there when I need them for work. I have a document that has an ID column formatted as:
"Number Word Number" and some values under a spec, lower tol, and upper tol column.
Where sometimes the number under ID is a integer or float and the word can be one of say 30 different possibilities. Ultimately these need to be read and then organized depending on the spec and lower/upper tol columns into something like below:
I'm using Pandas to read the data and do the manipulations I need so my question isn't so much of a how to do it, but more of a how should it best be done.
The way my code is written is basically a series of if statements that handle each of the scenarios I've come across so far, but based on other peoples code I've seen this is generally not done and as I understand considered poor practice. It's very basic if statements like:
if(The ID column has "Note" in it) then its a basic dimension
if(The ID column has Roughness) then its an Ra value
if(The ID column has Position in it) then its a true position etc
Problem is I'm not really sure what the "correct" way to do it would be in terms of making it more efficient and simpler. I have currently a series of 30+ if statements and ways to handle different situations that I've run into so far. Virtually all the code I've written is done in this overly specific and not very general coding methodology that even though it works I find personally overcomplicated but I'm not really sure what capabilities of python/pandas I'm sort of missing and not utilizing to simplify my code.
Since you need to test what the variable in ID is and do some staff accordingly you can't avoid the if statements most probably.What i suggest you to do since you have written the code is to reform the database.If there is not a very specific reason you have database with a structure like this,you should change it asap.
To be specific to ID add an (auto)increment unique number and break the 3 datapoints of ID column into 3 seperate columns.

About Data Cleaning

I am a pretty amateur data science student and I am working on a project where I compared two servers in a team based game but my two datasets are formatted differently from one another. One column for instance would be first blood, where one set of data stores this information as "blue_team_first_blood" and is stored as True or False where as the other stores it as just "first blood" and stores integers, (1 for blue team, 2 for red team, 0 for no one if applicable)
I feel like I can code around these difference but whats the best practice? should I take the extra step to make sure both data sets are formatted correctly or does it matter at all?
Data cleaning is usually the first step in any data science project. It makes sense to transform the data into a consistent format before any further processing steps.
you could consider transforming the "blue_team_first_blood" column to an integer format that is consistent with the other dataset, such as 1 for True and 0 for False. You could also consider renaming the "first blood" column in the second dataset to "blue_team_first_blood" to match the first dataset.
Overall, the best practice is to ensure that both datasets are formatted consistently and in a way that makes sense for your analysis. This will make it easier to compare the two datasets and draw meaningful insights.

How to automate calculations in Pandas dataframe

I'm currently struggling to find good information on how to calculate differences, percentages etc. using several columns and rows in a Pandas dataframe - and how to show the output in a nice table using Python.
Short example of what I'm going for:
I'm working with NBA data and have gathered a bunch of match statistics for home and away teams during the 2019/20 season (the season finishes later this month). The first row shows the Free Throw percentage and "Regular" means regular matches with audience members and "Bubble" denotes the matches without audience members.
A short view of my Pandas dataframe:
How do I automate the calculations using Python code? Feel free to give me examples!

How can I use machine learning to convert orientation sensor data into movement events?

I built a device based on a microcontroller with some sensors attached to it, one of them is an orientation sensor that currently delivers information about pitch, yaw, roll and acceleration for x,y,z. I would like to be able to detect movement "events" when the device is well... moved around.
For example I would like to detect a "repositioned" event which basically would consist of series of other events - "up" (picked up), "move" (moved in air to some other point), "down" (device put back down).
Since I am just starting to figure out how to make it possible I would like to ask if I am getting the right ideas or wasting my time.
My idea is currently that I use the data I probed to create a dataset and try to use machine learning to detect if each element belongs to one of the events I am trying to detect. So basically I took the device and first rotated it on table a few times, then picked it up several times, then moved it in the air and finally put it down several times. This generated a set of data that has a structure like that:
yaw,pitch,roll,accelx,accely,accelz,state
-140,178,178,17,-163,-495,stand
110,-176,-166,-212,-97,-389,down
118,-177,178,123,16,-146,up
166,-174,-171,-375,-145,-929,up
157,-178,178,4,-61,-259,down
108,177,-177,-55,76,-516,move
152,178,-179,35,98,-479,stand
175,177,-178,-30,-168,-668,move
100,177,178,-42,26,-447,stand
-14,177,179,42,-57,-491,stand
-155,177,179,28,-57,-469,stand
92,-173,-169,347,-373,-305,down
[...]
the last "state" column is added by me - I added this after each test movement type and then shuffled the rows.
I got about 450 records this way and the idea is to use the machine learning to predict the "state" column for each record coming from the running device, then I could queue up the outcomes and if in some short period the "up" events are majority I can take it the device is being picked up.
Maybe instead of using each reading as a data row I should rather take the last 10 readings (lets say) and try to predict what happens per column - i.e. if I know last 10 yaw readings were the changes during I was moving the device up I should rather use this data - so 10 readings from each of the 6 columns is processed as row and then I have 6 results - again the ratio of result types may make it possible to detect the "movement" event that happened during these 10 readings.
I am currently about 30% into an online ML course and enjoying it but I'd really like to hear some comments from more experienced people.
Are my ideas a reasonable solution or am I totally failing to understand how I can use ML? If so, what resources shall I use to get myself started?
Your idea to regroup the reading seems interesting. But it all depends on how often you get a record and how you plan on grouping them.
If you get a record every 10-100ms, it could be a good idea to group them since it will help to have more accurate data reducing noise. You could take the mean of each column to get rid of that noise and help your classifier to better classify your different states.
Otherwise if you have a record every second, I think it's a bad idea to regroup the records since you will most certainely mix several actions together.
The best way would be to try out both ways if you have the time ^^

How are moving objects best represented/implemented in a pygame-based simulation?

I am currently attempting to make a Bee Simulation for college and I have started working out the basics of how to do it.
The initial idea was to use PyGame and present the user with bees on the screen but for now I am just doing the basic functions first.
With the function which I am having issues with is the function where the bee will look for cells that are not being used and then go and use them. This is run on every new frame and run on every bee object so each bee will check each cell.
I'm using this code for this:
for i in range (0,len(hiveCells)):
if hiveCells[i] == "":
print("Not taken")
hiveCells[i] = "B"
else:
print("Taken")
But the issue with this is of course it finished within seconds and the bees had used the whole hive but I am needing a way to do this slowly and include time it takes to travel to that cell and then time it takes to actually use it?
What is the best way to do this? I was thinking of using coordinates and it will move closer to those coordinates every loop and check if it has reached them.
In order to include travel time for each Bee you would first need to define some kind of distance measure. A trivial choice would be to use the euclidian distance.
In order to incorporate this into you model you would need the following additions
Add a location (x,y), and possible (z) to each bee and each hive(cell)
Define how much time (in seconds) elapses per frame update.
Define the speed of the bee (in terms of m/s).
Now per frame update you know how much time has elapsed since the last update, and you can (using the bee speed and location) compute the new location of the bee.
The update frequency of the frame is now directly related to the time that is elapsed in your model.
Note that in order for this to work you would need some type of ID which relates the bee to the hive cell it claimed. I would recommend giving each bee a unique ID.
Then as soon as the bee claims a hive cell you store the unique bee ID in the hive cell, such that at each frame update you can compute the new location for each bee with respect to the hive cell it is flying to.
Additionally note that in order for this scheme to work the hive cell would need a location (which you could store in a similar sized array. But it might be the most clean to create an object for each Hive (cell), which stores it's coordinates and the bee-ID which claimed it. This would also allow you to further improve your model by adding additional information to (i.e. honey present, or whatever) the hive (cells)/bees.

Categories