About Data Cleaning

About Data Cleaning - python

I am a pretty amateur data science student and I am working on a project where I compared two servers in a team based game but my two datasets are formatted differently from one another. One column for instance would be first blood, where one set of data stores this information as "blue_team_first_blood" and is stored as True or False where as the other stores it as just "first blood" and stores integers, (1 for blue team, 2 for red team, 0 for no one if applicable)
I feel like I can code around these difference but whats the best practice? should I take the extra step to make sure both data sets are formatted correctly or does it matter at all?

Data cleaning is usually the first step in any data science project. It makes sense to transform the data into a consistent format before any further processing steps.

you could consider transforming the "blue_team_first_blood" column to an integer format that is consistent with the other dataset, such as 1 for True and 0 for False. You could also consider renaming the "first blood" column in the second dataset to "blue_team_first_blood" to match the first dataset.
Overall, the best practice is to ensure that both datasets are formatted consistently and in a way that makes sense for your analysis. This will make it easier to compare the two datasets and draw meaningful insights.

Related

Handling Multiple if/else and Special Cases

So I'm fairly new to coding only having relatively simple scripts here and there when I need them for work. I have a document that has an ID column formatted as:
"Number Word Number" and some values under a spec, lower tol, and upper tol column.
Where sometimes the number under ID is a integer or float and the word can be one of say 30 different possibilities. Ultimately these need to be read and then organized depending on the spec and lower/upper tol columns into something like below:
I'm using Pandas to read the data and do the manipulations I need so my question isn't so much of a how to do it, but more of a how should it best be done.
The way my code is written is basically a series of if statements that handle each of the scenarios I've come across so far, but based on other peoples code I've seen this is generally not done and as I understand considered poor practice. It's very basic if statements like:
if(The ID column has "Note" in it) then its a basic dimension
if(The ID column has Roughness) then its an Ra value
if(The ID column has Position in it) then its a true position etc
Problem is I'm not really sure what the "correct" way to do it would be in terms of making it more efficient and simpler. I have currently a series of 30+ if statements and ways to handle different situations that I've run into so far. Virtually all the code I've written is done in this overly specific and not very general coding methodology that even though it works I find personally overcomplicated but I'm not really sure what capabilities of python/pandas I'm sort of missing and not utilizing to simplify my code.

Since you need to test what the variable in ID is and do some staff accordingly you can't avoid the if statements most probably.What i suggest you to do since you have written the code is to reform the database.If there is not a very specific reason you have database with a structure like this,you should change it asap.
To be specific to ID add an (auto)increment unique number and break the 3 datapoints of ID column into 3 seperate columns.

Slow loop python to search data in antoher data frame in python

I have two data frames : one with all my data (called 'data') and one with latitudes and longitudes of different stations where each observation starts and ends (called 'info'), I am trying to get a data frame where I'll have the latitude and longitude next to each station in each observation, my code in python :
for i in range(0,15557580):
for j in range(0,542):
if data.year[i] == '2018' and data.station[i]==info.station[j]:
data.latitude[i] = info.latitude[j]
data.longitude[i] = info.longitude[j]
break
but since I have about 15 million observation , doing it, takes a lot of time, is there a quicker way of doing it ?
Thank you very much (I am still new to this)
edit :
my file info looks like this (about 500 observation, one for each station)
my file data like this (theres other variables not shown here) (about 15 million observations , one for each travel)
and what i am looking to get is that when the stations numbers match that the resulting data would look like this :

This is one solution. You can also use pandas.merge to add 2 new columns to data and perform the equivalent mapping.
# create series mappings from info
s_lat = info.set_index('station')['latitude']
s_lon = info.set_index('station')['latitude']
# calculate Boolean mask on year
mask = data['year'] == '2018'
# apply mappings, if no map found use fillna to retrieve original data
data.loc[mask, 'latitude'] = data.loc[mask, 'station'].map(s_lat)\
.fillna(data.loc[mask, 'latitude'])
data.loc[mask, 'longitude'] = data.loc[mask, 'station'].map(s_lon)\
.fillna(data.loc[mask, 'longitude'])

This is a very recurrent and important issue when anyone starts to deal with large datasets. Big Data is a whole subject in itself, here is a quick introduction to the main concepts.
1. Prepare your dataset
In big data, 80% to 90% of the time is spent gathering, filtering and preparing your datasets. Create subsets of data, making them optimized for your further processing.
2. Optimize your script
Short code does not always mean optimized code in term of performance. In your case, without knowing about your dataset, it is hard to say exactly how you should process it, you will have to figure out on your own how to avoid the most computation possible while getting the exact same result. Try to avoid any unnecessary computation.
You can also consider splitting the work over multiple threads if appropriate.
As a general rule, you should not use for loops and break them inside. Whenever you don't know precisely how many loops you will have to go through in the first place, you should always use while or do...while loops.
3. Consider using distributed storage and computing
This is a subject in itself that is way too big to be all explained here.
Storing, accessing and processing data in a serialized way is faster of small amount of data but very inappropriate for large datasets. Instead, we use distributed storage and computing frameworks.
It aims at doing everything in parallel. It relies on a concept named MapReduce.
The first distributed data storage framework was Hadoop (eg. Hadoop Distributed File System or HDFS). This framework has its advantages and flaws, depending on your application.
In any case, if you are willing to use this framework, it will probably be more appropriate for you not to use MR directly on top HDFS, but using a upper level one, preferably in-memory, such as Spark or Apache Ignite on top of HDFS. Also, depending on your needs, try to have a look at frameworks such as Hive, Pig or Sqoop for example.
Again this subject is a whole different world but might very well be adapted to your situation. Feel free to document yourself about all these concepts and frameworks, and leave your questions if needed in the comments.

Making Dataframe Analysis faster

I am using three dataframes to analyze sequential numeric data - basically numeric data captured in time. There are 8 columns, and 360k entries. I created three identical dataframes - one is the raw data, the second a "scratch pad" for analysis and a third dataframe contains the analyzed outcome. This runs really slowly. I'm wondering if there are ways to make this analysis run faster? Would it be faster if instead of three separate 8 column dataframes I had one large one 24 column dataframe?

Use cProfile and lineprof to figure out where the time is being spent.
To get help from others, post your real code and your real profile results.
Optimization is an empirical process. The little tips people have are often counterproductive.

Most probably it doesn't matter because pandas stores each column separately anyway (DataFrame is a collection of Series). But you might get better data locality (all data next to each other in memory) by using a single frame, so it's worth trying. Check this empirically.

Rereading this post I am realizing I could have been clearer. I have been using write statement like:
dm.iloc[p,XCol] = dh.iloc[x,XCol]
to transfer individual cells of one dataframe (dh) to a different row of a second dataframe (dm). It ran very slowly but I needed this specific file sorted and I just lived with the performance.
According to "Learning Pandas" by Michael Heydt, pg 146, ".iat" is faster than ".iloc" for extracting (or writing) scalar values from a dataframe. I tried it and it works. With my original 300k row files, run time was 13 hours(!) using ".iloc", same datafile using ".iat" ran in about 5 minutes.
Net - this is faster:
dm.iat[p,XCol] = dh.iat[x,XCol]

Transfer data from table to objects (Python)

I am new to programming and would appreciate any advice regarding my assignment. And before you read what is following, please accept my appologies for being so silly.
Context
Every week I receive several .txt documents. Each document goes something like this:
№;Date;PatientID;Checkpoint;Parameter1;Parameter2;Parameter3/n
1;01.02.2014;0001;1;25.3;24.2;40.0/n
2;01.02.2014;0002;1;22.1;19.1;40.7/n
3;02.02.2014;0003;1;20.1;19.3;44.2/n
4;04.02.2014;0001;2;22.8;16.1;39.3/n
...
The first line contains column names, and every other line represents an observation. In fact there are over 200 columns and about 3000 lines in each .txt file I get. Moreover, every week column names may be slightly different from what they were a week before, and every week the number of observations increases.
My job is to select the observations that satisfy certain parameter requirements and build boxplots for some of the parameters.
What I think I should do
I want to make a program in Python 2.7.6 that would consist of four parts.
Code that would turn every observation into an object, so that I can access attributes like this:
obs1.checkpoint = 1
obs4.patientid = "0001"
I literary want column names to become attribute names.
Having done this it would be nice to create an object for every unique PatientID. And I would like objects representing observations related to this patient to be attributes of patient objects. My goal here is to make it easy to check if from checkpoint 1 to checkpoint 2 the patient's parameter increases.
Code that would select the observations I need.
Code that would build boxplots.
Code that would combine the three parts above into one program.
What I've found so far
I have found some working code that dynamically adds attributes to instances:
http://znasibov.info/blog/html/2010/03/10/python-classes-dynamic-properties.html
I'm afraid I don't fully understand how it works yet, but I think it might be of use in my case to turn column names into attribute names.
I have also found that creating variables dynamically is frowned upon:
http://nedbatchelder.com/blog/201112/keep_data_out_of_your_variable_names.html
Questions
Is it a good idea to turn every line in the table into an object?
If so, how do I go about it?
How do I create as many objects as there are lines in the table, and how do I name these objects?
What is the best way to turn column names into class attributes?

Storing and reloading large multidimensional data sets in Python

I'm going to be running a large number of simulations producing a large amount of data that needs to be stored and accessed again later. Output data from my simulation program is written to text files (one per simulation). I plan on writing a Python program that reads these text files and then stores the data in a format more convenient for analyzing later. After quite a bit of searching, I think I'm suffering from information overload, so I'm putting this question to Stack Overflow for some advice. Here are the details:
My data will basically take the form of a multidimensional array where each entry will look something like this:
data[ stringArg1, stringArg2, stringArg3, stringArg4, intArg1 ] = [ floatResult01, floatResult02, ..., floatResult12 ]
Each argument has roughly the following numbers of potential values:
stringArg1: 50
stringArg2: 20
stringArg3: 6
stringArg4: 24
intArg1: 10,000
Note, however, that the data set will be sparse. For example, for a given value of stringArg1, only about 16 values of stringArg2 will be filled in. Also, for a given combination of (stringArg1, stringArg2) roughly 5000 values of intArg1 will be filled in. The 3rd and 4th string arguments are always completely filled.
So, with these numbers my array will have roughly 50*16*6*24*5000 = 576,000,000 result lists.
I'm looking for the best way to store this array such that I can save it and reopen it later to either add more data, update existing data, or query existing data for analysis. Thus far I've looked into three different approaches:
a relational database
PyTables
Python dictionary that uses tuples as the dictionary keys (using pickle to save & reload)
There's one issue I run into in all three approaches, I always end up storing every tuple combination of (stringArg1, stringArg2, stringArg3, stringArg4, intArg1), either as a field in a table, or as the keys in the Python dictionary. From my (possibly naive) point of view, it seems like this shouldn't be necessary. If these were all integer arguments then they would just form the address of each data entry in the array, and there wouldn't be any need to store all the potential address combinations in a separate field. For example, if I had a 2x2 array = [[100, 200] , [300, 400]] you would retrieve values by asking for the value at an address array[0][1]. You wouldn't need to store all the possible address tuples (0,0) (0,1) (1,0) (1,1) somewhere else. So I'm hoping to find a way around this.
What I would love to be able to do is define a table in PyTables, where cells in this first table contain other tables. For example, the top-level tables would have two columns. Entries in the first column would be the possible values of stringArg1. Each entry in the second column would be a table. These sub-tables would then have two columns, the first being all the possible values of stringArg2, the second being another column of sub-sub-tables...
That kind of solution would be straightforward to browse and query (particularly if I could use ViTables to browse the data). The problem is PyTables doesn't seem to support having the cells of one table contain other tables. So I seem to have hit a dead end there.
I've been reading up on data warehousing and the star schema approach, but it still seems like your fact table would need to contain tuples of every possible argument combination.
Okay, so that's pretty much where I am. Any and all advice would be very much appreciated. At this point I've been searching around so much that my brain hurts. I figure it's time to ask the experts.

Why not using a big table for keep all the 500 millions of entries? If you use on-the-flight compression (Blosc compressor recommended here), most of the duplicated entries will be deduped, so the overhead in storage is kept under a minimum. I'd recommend give this a try; sometimes the simple solution works best ;-)

Is there a reason the basic 6 table approach doesn't apply?
i.e. Tables 1-5 would be single column tables defining the valid values for each of the fields, and then the final table would be a 5 column table defining the entries that actually exist.
Alternatively, if every value always exists for the 3rd and 4th string values as you describe, the 6th table could just consist of 3 columns (string1, string2, int1) and you generate the combinations with string3 and string4 dynamically via a Cartesian join.

I'm not entirely sure of what you're trying to do here, but it looks like you trying to create a (potentially) sparse multidimensional array. So I wont go into details for solving your specific problem, but the best package I know that deals with this is Numpy Numpy. Numpy can
be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases.
I've used Numpy many times for simulation data processing and it provides many useful tools including easy file storage/access.
Hopefully you'll find something in it's very easy to read documentation:
Numpy Documentation with Examples

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.