Risks re changing NaN responses to zero in Python DataFrame - python

I have a large-ish survey dataset to clean (300 columns, 30000 rows) and the columns are mixed. I'm using Python with pandas and numpy. Am very much in the learner wheels stage using Python.
Some of the columns had Y or N answers to questions (and these are filled "Y" or "N").
Some were likert scale questions with 5 possible answers. In the CSV file each answer (agree, disagree etc.) has its own column. This has imported as 1 for a yes and NaN otherwise.
Other questions had up to 10 possible answers (e.g. for age) and these have imported as a string in one column - i.e. "a. 0-18" or "b. 19-25" and so on. Changing those will be interesting!
As I go through I'm changing the Y/N answers to 1 or 0. However, for the likert scale columns, I'm concerned that there might be a risk with doing the same thing. Does anyone have a view as to whether it would be preferable to leave the data for those as NaN for now? Gender is the same - there is a separate column for Males and one for Females, both populated with 1 for yes and NaN for no.
I'm intending to use Python for the data analysis/charting (will import matplotlib & seaborn). As this is new to me I'm guessing that changes I make now may have unintended consequences later!
Any guidance you can give would be much appreciated.
Thanks in advance.

If there aren't 0's that mean anything, it's fine to fill the NA's with a value (0 for convenience). It all depends on your data. That said, 300 x 30k isn't that big. Save it off as a CSV and just experiment in IPython Notebook, Pandas can probably read it in under a second, so if you screw anything up, just reload.
Here's a quick bit of code that can condense whatever multi-column question sets into single columns with some number:
df = pd.DataFrame({
1: {'agree': 1},
2: {'disagree': 1},
3: {'whatevs': 1},
4: {'whatevs': 1}}).transpose()
df
question_sets = {
'set_1': ['disagree', 'whatevs', 'agree'], # define these lists from 1 to whatever
}
for setname, setcols in question_sets.items():
# plug the NaNs with 0
df[setcols].fillna(0)
# scale each column with 0 or 1 in the question set with an ascending value
for val, col in enumerate(setcols, start=1):
df[col] *= val
# create new column by summing all the question set columns
df[setname] = df[question_set_columns].sum(axis=1)
# delete all the old columns
df.drop(setcols, inplace=True, axis=1)
df

Related

Change one row in a pandas dataframe based on the value of another row

I have a pandas DataFrame with data from an icecream freezer. Several columns describe the different temperatures in the system as well as some other things.
One column, named 'Defrost status', tells me when the freezer was defreezing to remove abundant ice with boolean values.
Those 'defrosts' is what I am interested in, so I added another column named "around_defrost". This column currently only has NaN values, but I want to change them to 'True' whenever there is a defrost within 30 minutes away from that specific row in the dataframe.
The data is recorded every minute so 30 minutes would mean 30 rows before a defrost and 30 rows behind it need to be set to 'True'
I have tried to do this with itterrows, ittertuples and by playing with the indexes as seen in the figure below but nu success so far. If anyone has a good idea of how this would could be done, I'd really appreciate it!
enter image description here
You need to use dataframe.rolling:
df = df.sort_values("Time") #sort by Time
df['around_defrost'] = df['Defrost status'].rolling(60, center=True, min_periods = 0).apply(
lambda x: True if True in x else False, raw=True)
EDIT: you may need rolling(61, center=True) since you want to consider the row in question AND 30 before and after.

How to create a customised data frame based on column values in python?

I have a initial dummy dataframe with 7 columns, 1 row and given columns names and initialised zeros
d = pandas.DataFrame(numpy.zeros((1, 7)))
d = d.rename(columns={0:"Gender_M",
1:"Gender_F",
2:"Employed_Self",
3:"Employed_Employee",
4:"Married_Y",
5:"Married_N",
6:"Salary"})
Now I have a single record
data = [['M', 'Employee', 'Y',85412]]
data_test = pd.DataFrame(data, columns = ['Gender', 'Employed', 'Married','Salary'])
From the single record I have to create a new dataframe, where if the
Gender column has M, then Gender_M should be changed to 1, Gender_F left with zero
Employed column has Employee, then Employed_Employee changed to 1, Employed_Self with zero
same with Married and for the integer column Salary, just set the value 85412, I tried with if statements, but its a long set of codes, is there a simple way?
Here is one way using update twice
d.update(df)
df.columns=df.columns+'_'+df.astype(str).iloc[0]
df.iloc[:]=1
d.update(df)
Alas homework is often designed to be boring and repetitive ...
You do not have a problem - rather you want other people to do the work for you. SO is not for this purpose - post a problem, you will find many people willing to help.
So show your FULL answer then ask for "Is there a better way"

What is the most efficient way to dedupe a Pandas dataframe that has typos?

I have a dataframe of names and addresses that I need to dedupe. The catch is that some of these fields might have typos, even though they are still duplicates. For example, suppose I had this dataframe:
index name zipcode
------- ---------- ---------
0 john doe 12345
1 jane smith 54321
2 john dooe 12345
3 jane smtih 54321
The typos could occur in either name or zipcode, but let's just worry about the name one for this question. Obviously 0 and 2 are duplicates as are 1 and 3. But what is the computationally most efficient way to figure this out?
I have been using the Levenshtein distance to calculate the distance between two strings from the fuzzywuzzy package, which works great when the dataframe is small and I can iterate through it via:
from fuzzywuzzy import fuzz
for index, row in df.iterrows():
for index2, row2 in df.iterrows():
ratio = fuzz.partial_ratio(row['name'], row2['name'])
if ratio > 90: # A good threshold for single character typos on names
# Do something to declare a match and throw out the duplicate
Obviously this is not a approach that will scale well and unfortunately I need to dedupe a dataframe that is about 7M rows long. And obviously this gets worse if I also need to dedupe potential typos in the zipcode too. Yes, I could do this with .itertuples(), which would give me a factor of ~100 speed improvement, but am I missing something more obvious than this clunky O(n^2) solution?
Are there more efficient ways I could go about deduping this noisy data? I have looked into the dedupe package, but that requires labeled data for supervised learning and I don't have any nor am I under the impression that this package will handle unsupervised learning. I could roll my own unsupervised text clustering algorithm, but I would rather not have to go that far if there is an existing, better approach.
the package pandas-dedupe can help you with your task.
pandas-dedupe works as follows: first it asks you to label a bunch of records he is most confused about. Afterwards, he uses this knowledge to resolve duplicates entitites. And that is it :)
You can try the following:
import pandas as pd
from pandas_dedupe import dedupe_dataframe
df = pd.DataFrame.from_dict({'name':['john', 'mark', 'frank', 'jon', 'john'], 'zip':['11', '22', '33', '11', '11']})
dd = dedupe_dataframe(df, ['name', 'zip'], canonicalize=True, sample_size=1)
The console will then ask you to label some example.
If duplicates click 'y', otherwise 'n'. And once done click 'f' for finished.
It will then perform deduplication on the entire dataframe.
The string-grouper package is perfect for this. It uses TF-IDF with N-Grams underneath and is much faster than levenshtein.
from string_grouper import group_similar_strings
def group_strings(strings: List[str]) -> Dict[str, str]:
series = group_similar_strings(pd.Series(strings))
name_to_canonical = {}
for i, s in enumerate(strings):
deduped = series[i]
if (s != deduped):
name_to_canonical[s] = deduped
return name_to_canonical
For zipcodes, I can fairly confidently state that you can't detect typos without some mechanism for field validation (two zipcodes could look very close and both be valid zipcodes)
If the data is sorted, with some assumptions about where the typo is made (First letter is highly unlikely except in cases of common substitutions) you might be able to take advantage of that and search them as distinct per-letter chunks. If you assume the same for the last name, you can divide them into 26^2 distinct subgroups and only have them search within their field.
You could also try an approach just looking at the set of ORIGINAL first names and last names. If you're searching 7million items, and you have 60 thousand "Johns", you only need to compare them once against "Jhon" to find the error, then search for the "Jhon" and remove or fix it. But this is assuming, once again, that you break this up into a first-name and last-name series within the frame (using panda's str.extract(), with "([\w]+) ([\w]+)" or some such as your regex, as the data demands)

Comparing data in 2 separate DataFrames and producing a result in Python/Pandas

I am new to Python and I'm trying to produce a similar result of Excel's IndexMatch function with Python & Pandas, though I'm struggling to get it working.
Basically, I have 2 separate DataFrames:
The first DataFrame ('market') has 7 columns, though I only need 3 of those columns for this exercise ('symbol', 'date', 'close'). This df has 13,948,340 rows.
The second DataFrame ('transactions') has 14 columns, though only I only need 2 of those columns ('i_symbol', 'acceptance_date'). This df has 1,428,026 rows.
My logic is: If i_symbol is equal to symbol and acceptance_date is equal to date: print symbol, date & close. This should be easy.
I have achieved it with iterrows() but because of the size of the dataset, it returns a single result every 3 minutes - which means I would have to run the script for 1,190 hours to get the final result.
Based on what I have read online, itertuples should be a faster approach, but I am currently getting an error:
ValueError: too many values to unpack (expected 2)
This is the code I have written (which currently produces the above ValueError):
for i_symbol, acceptance_date in transactions.itertuples(index=False):
for symbol, date in market.itertuples(index=False):
if i_symbol == symbol and acceptance_date == date:
print(market.symbol + market.date + market.close)
2 questions:
Is itertuples() the best/fastest approach? If so, how can I get the above working?
Does anyone know a better way? Would indexing work? Should I use an external db (e.g. mysql) instead?
Thanks, Matt
Regarding question 1: pandas.itertuples() yields one namedtuple for each row. You can either unpack these like standard tuples or access the tuple elements by name:
for t in transactions.itertuples(index=False):
for m in market.itertuples(index=False):
if t.i_symbol == m.symbol and t.acceptance_date == m.date:
print(m.symbol + m.date + m.close)
(I did not test this with data frames of your size but I'm pretty sure it's still painfully slow)
Regarding question 2: You can simply merge both data frames on symbol and date.
Rename your "transactions" DataFrame so that it also has columns named "symbol" and "date":
transactions = transactions[['i_symbol', 'acceptance_date']]
transactions.columns = ['symbol','date']
Then merge both DataFrames on symbol and date:
result = pd.merge(market, transactions, on=['symbol','date'])
The result DataFrame consists of one row for each symbol/date combination which exists in both DataFrames. The operation only takes a few seconds on my machine with DataFrames of your size.
#Parfait provided the best answer below as a comment. Very clean, worked incredibly fast - thank you.
pd.merge(market[['symbol', 'date', 'close']], transactions[['i_symbol',
'acceptance_date']], left_on=['symbol', 'date'], right_on=['i_symbol',
'acceptance_date']).
No need for looping.

How can I efficiently run groupby() over subsets of a dataframe to avoid MemoryError

I have a decent sized dataframe (roughly: df.shape = (4000000, 2000)) that I want to use .groupby().max() on. Unfortunately, neither my laptop nor the server I have access to can do this without throwing a MemoryError (The laptop has 16G of RAM, the server has 64.) This is likely due to the datatypes in a lot of the columns. Right now, I'm considering those are fixed and immutable (many, many dates, large integers, etc), though perhaps that could be part of the solution.
The command I would like is simply new_df = df.groupby('KEY').max().
What is the most efficient way to break down this problem to prevent running into memory problems? Some things I've tried, to varying success:
Break the df into subsets and run .groupby().max on those subsets, then concatenating. Issue: the size of the full df can vary, and is likely to grow. I'm not sure the best way to break the df apart so that the subsets are definitely not going to throw the MemoryError.
Include a subset set of columns on which to run the .groupby() in a new df, then merge this with the original. Issue: the number of columns in this subset can vary (much smaller or larger than current), though the names of the columns all include the prefix ind_.
Look for out-of-memory management tools. As of yet, I've not found anything useful.
Thanks for any insight or help you can provide.
EDIT TO ADD INFO
The data is for a predictive modeling exercise, and the number of columns stems from making binary variables from a column of discrete (non-continuous/categorical) values. The df will grow in column size if another column of values goes through the same process. Also, data is originally pulled from a SQL query; the set of items the query finds is likely to grow over time, meaning both the number of rows will grow, and (since the number of distinct values in one or more columns may grow, so will the number of columns after making binary indicator variables). The data pulled goes through extensive transformation, and gets merged with other datasets, making it implausible to run the grouping in the database itself.
There are repeated observations by KEY, which have the the same value except for the column I turn into indicators (this is just to show shape/value samples: actual df has dates, integers 16 digits or longer, etc):
KEY Col1 Col2 Col3
A 1 blue car
A 1 blue bike
A 1 blue train
B 2 green car
B 2 green plane
B 2 green bike
This should become, with dummied Col3:
KEY Col1 Col2 ind_car ind_bike ind_train ind_plane
A 1 blue 1 1 1 0
B 2 green 1 1 0 1
So the .groupby('KEY') gets me the groups, and the .max() gets the new indicator columns with the right values. I know the `.max()' process might be bogged down by getting a "max" for a string or date column.

Categories