Numpy, pandas merge rows - python

Im working on numpy, pandas and need to "merge" rows. I have column martial-status and there are things like this:
'Never-married', 'Divorced', 'Separated', 'Windowed'
and:
'Married-civ-spouse','Married-spouse-absent', 'Married-AF-spouse'
Im wondering how to merge them to just 2 rows, for the first 4 to single and for the second one's in relationship. I need it for one hot encoding later.
And for sample output the martial-status should be just single or in relationship adequately to what i mention before

You can use pd.Series.map to convert certain values to other. For this you need a dictionary, that assigns each value with a new value. The values not presented in the dictionary will be replaced with NaN
married_map = {
status:'Single'
for status in ['Never-married', 'Divorced', 'Separated', 'Widowed']}
married_map.update({
status:'In-relationship'
for status in ['Married-civ-spouse','Married-spouse-absent', 'Married-AF-spouse']})
df['marital-status'].map(married_map)

Related

How to read SPSS aka (.sav) in Python

It's my first time using Jupyter Notebook to analyze survey data (.sav file), and I would like to read it in a way it will show the metadata so I can connect the answers with the questions. I'm totally a newbie in this field, so any help is appreciated!
import pandas as pd
import pyreadstat
df, meta = pyreadstat.read_sav('./SimData/survey_1.sav')
type(df)
type(meta)
df.head()
Please lmk if there is an additional step needed for me to be able to see the metadata!
The meta object contains the metadata you are looking for. Probably the most useful attributes to look at are:
meta.column_names_to_labels : it's a dictionary with column names as you have in your pandas dataframe to labels meaning longer explanations on the meaning of each column
print(meta.column_names_to_labels)
meta.variable_value_labels : a dict where keys are column names and values are a dict where the keys are values you find in your dataframe and values are value labels.
print(meta.variable_value_labels)
For instance if you have a column "gender' with values 1 and 2, you could get:
{"gender": {1:"male", 2:"female"}}
which means value 1 is male and 2 female.
You can get those labels from the beginning if you pass the argument apply_value_formats :
df, meta = pyreadstat.read_sav('survey.sav', apply_value_formats=True)
You can also apply those value formats to your dataframe anytime with pyreadstat.set_value_labels which returns a copy of your dataframe with labels:
df_copy = pyreadstat.set_value_labels(df, meta)
meta.missing_ranges : you get labels for missing values. Let's say in the survey in certain variable they encoded 1 meaning yes, 2 no and then mussing values, 5 meaning didn't answer, 6 person not at home. When you read the dataframe by default you will get values 1 and 2 and NaN (missing) instead of 5 and 6. You can pass the argument user_missing to get 5 and 6, and meta.missing_ranges will tell you that 5 and 6 are missing values. Variable_value_labels will give you the "didn't answer" and "person not at home" labels.
df, meta = pyreadstat.read_sav("survey.sav", user_missing=True)
print(meta.missing_ranges)
print(meta.variable_value_labels)
These are the potential pieces of information useful for your case, not necessarily all of these pieces will be present in your dataset.
More information here: https://ofajardo.github.io/pyreadstat_documentation/_build/html/index.html

set value in pandas not knowing if column and row exists, avoiding NaN

I am building a shallow array in pandas containing pair of values (concept - document)
doc1 doc2
concept1 1 0
concept2 0 1
concept3 1 0
I parse a XML file and get pairs (concepts - doc)
every time a new pair comes in I add it to the pandas.
Since the pairs coming in might or might not contain values already present in the rows and/or columns (whatever new concept or whatever new column) I use the following code:
onp=np.arange(1,21,1).reshape(4,5)
oindex=['concept1','concept2','concept3','concept4',]
ohead=['doc1','doc2','doc3','doc5','doc6']
data=onp
mydf=pd.DataFrame(data,index=oindex, columns=ohead)
#... loop ...
mydf.loc['conceptXX','ep8']=1
it works well, only that the value in the data frame is 1.0 and not 1, boolean, and when a new row and/or column is added then the rest of the values are NaN. How can I avoid that. All the values added should be 0 or 1. (Note: the intention is to have also some columns for calculations, so I can not transform just all the dataframe into boolean type for instance:
mydf=mydf.astype(object)
thanks.
SECOND EDIT AFTER ALollz COMMENT
More explanation of the real problem.
I have an XML file that gives me the data in the following way:
<names>
<name>michael</name>
<documents>
<document>doc1</document>
<document>doc2</document>
</documents>
</name>
<name>mathieu</name>
<documents>
<document>doc1</document>
<document>docN</document>
</documents>
</name>
</names>
...
I want to pass this data to a dataframe to make calculations. Basically there are names that appear in different documents when parsing the XML with:
tree = ET.parse(myinputFile)
root = tree.getroot()
I am going adding one by one new values into the dataframe.
When adding sometimes a name is already present in the dataframe, but a new doc has to be added and viceversa.
I hope to have clarify a bit
I was about to write this as solution:
mydf.fillna(0, inplace=True)
mydf=mydf.astype(int)
changing all the NaN values by 0 and then convert them into int to avoid floats.
that has a negative side because i might want to have some columns with text data. in that case an error occur.

Alternative for drop_duplicates python 3.6

I am working on some huge volume of data, rows around 50 millions.
I want to find unique columns values from multiple columns. I use below script.
dataAll[['Frequency', 'Period', 'Date']].drop_duplicates()
But this is taking long time, more than 40minutes.
I found some alternative:
pd.unique(dataAll[['Frequency', 'Period', 'Date']].values.ravel('K'))
but above script will give array, but I need in dataframe like first script will give as below
Generaly your new code is imposible convert to DataFrame, because:
pd.unique(dataAll[['Frequency', 'Period', 'Date']].values.ravel('K'))
create one big 1d numpy array, so after remove duplicates is impossible recreate rows.
E.g. if there are 2 unique values 3 and 1 is impossible find which datetimes are for 3 and for 1.
But if there is only one unique value for Frequency and for each Period is possible find Date like in sample, solution is possible.
EDIT:
One possible alternative is use dask.dataframe.DataFrame.drop_duplicates.

Using NumPy Vectorization to Create Column Containing Length of Another Column

I think I have a pretty straightforward question here. Essentially I have a table with one column where each row contains a set of values that had previously been converted from a JSON string.
For example, here is one cell value for the column "options":
[u'Tide Liquid with a Touch of Downy April Fresh 69oz', u'Tide Liquid with Febreze Spring & Renewal 69oz (HE or Non-HE)', u'Tide Liquid HE with Febreze Sport 69oz', u'Tide Liquid HE Clean Breeze 75oz', u'Tide Liquid Original 75oz', u'Other']
I want to add a new column that simply counts the number of values in this list. I can do this row by row using a code like such:
df['num_choices'] = len(df.loc[row_num,'options'])
(i.e. I want to count the number of values in the column "options" and return that count in a new column called "num_choices")
Running this on the provided example above (with the input row#) will create a new column next to it with the value 6, since the count of options is 6.
How can I do this systematically for all 5,000 of my rows?
I tried to do this with Pandas iterrow() function, but I've been told that would be way less efficient than simply using NumPy ndArray vectorization. But I can't seem to figure out how to do that.
Thanks so much for your help!
As mentioned in the comments above, there's not really any way to vectorize operations on arrays that contain arbitrary Python objects.
I don't think you can do much better than using a simple for loop or list comprehension, e.g.:
df['num_choices'] = np.array([len(row) for row in df.options])

Return subset/slice of Pandas dataframe based on matching column of other dataframe, for each element in column?

So I think this is a relatively simple question:
I have a Pandas data frame (A) that has a key column (which is not unique/will have repeats of the key)
I have another Pandas data frame (B) that has a key column, which may have many matching entries/repeats.
So what I'd like is a bunch of data frames (a list, or a bunch of slice parameters, etc.), one for each key in A (regardless of whether it's unique or not)
In [bad] pseudocode:
for each key in A:
resultDF[] = Rows in B where B.key = key
I can easily do this iteratively with loops, but I've read that you're supposed to slice/merge/join data frames holistically, so I'm trying to see if I can find a better way of doing this.
A join will give me all the stuff that matches, but that's not exactly what I'm looking for, since I need a resulting dataframe for each key (i.e. for every row) in A.
Thanks!
EDIT:
I was trying to be brief, but here are some more details:
Eventually, what I need to do is generate some simple statistical metrics for elements in the columns of each row.
In other words, I have a DF, call it A, and it has a r rows, with c columns, one of which is a key. There may be repeats on the key.
I want to "match" that key with another [set of?] dataframe, returning however many rows match the key. Then, for that set of rows, I want to, say, determine the min and max of certain element (and std. dev, variance, etc.) and then determine if the corresponding element in A falls within that range.
You're absolutely right that it's possible that if row 1 and row 3 of DF A have the same key -- but potentially DIFFERENT elements -- they'd be checked against the same result set (the ranges of which obviously won't change). That's fine. These won't likely ever be big enough to make that an issue (but if there's the better way of doing it, that's great).
The point is that I need to be able to do the "in range" and stat summary computation for EACH key in A.
Again, I can easily do all of this iteratively. But this seems like the sort of thing pandas could do well, and I'm just getting into using it.
Thanks again!
FURTHER EDIT
The DF looks like this:
df = pd.DataFrame([[1,2,3,4,1,2,3,4], [28,15,13,11,12,23,21,15],['keyA','keyB','keyC','keyD', 'keyA','keyB','keyC','keyD']]).T
df.columns = ['SEQ','VAL','KEY']
SEQ VAL KEY
0 1 28 keyA
1 2 15 keyB
2 3 13 keyC
3 4 11 keyD
4 1 12 keyA
5 2 23 keyB
6 3 21 keyC
7 4 15 keyD
Both DF's A and B are of this format.
I can iterative get the resultant sets by:
loop_iter = len(A) / max(A['SEQ_NUM'])
for start in range(0, loop_iter):
matchA = A.iloc[start::loop_iter, :]['KEY']
That's simple. But I guess I'm wondering if I can do this "inline". Also, if for some reason the numeric ordering breaks (i.e. the SEQ get out of order) this this won't work. There seems to be no reason NOT to do it explicitly splitting on the keys, right? So perhaps I have TWO questions: 1). How to split on keys, iteratively (i.e. accessing a DF one row at a time), and 2). How to match a DF and do summary statistics, etc., on a DF that matches on the key.
So, once again:
1). Iterate through DF A, going one at a time, and grabbing a key.
2). Match the key to the SET (matchB) of keys in B that match
3). Do some stats on "values" of matchB, check to see if val.A is in range, etc.
4). Profit!
Ok, from what I understand, the problem at its most simple is that you have a pd.Series of values (i.e. a["key"], which let's just call keys), which correspond to the rows of a pd.DataFrame (the df called b), such that set(b["key"]).issuperset(set(keys)). You then want to apply some function to each group of rows in b where the b["key"] is one of the values in keys.
I'm purposefully disregarding the other df -- a -- that you mention in your prompt, because it doesn't seem to bear any significance to the problem, other than being the source of keys.
Anyway, this is a fairly standard sort of operation -- it's a groupby-apply.
def descriptive_func(df):
"""
Takes a df where key is always equal and returns some summary.
:type df: pd.DataFrame
:rtype: pd.Series|pd.DataFrame
"""
pass
# filter down to those rows we're interested in
valid_rows = b[b["key"].isin(set(keys))]
# this groups by the value and applies the descriptive func to each sub df in turn
summary = valid_rows.groupby("key").apply(descriptive_func)
There are a few built in methods on the groupby object that are useful. For example, check out valid_rows.groupby("key").sum() or valid_rows.groupby("key").describe(). Under the covers, these are really similar uses of apply. The shape of the returned summary is determined by the applied function. The unique grouped-by values -- those of b["key"] -- always constitute the index, but if the applied function returns a scalar, summary is a Series; if the applied function returns a Series, then summary constituted of the return Series as rows; if the applied function returns a DataFrame, then the result is a multiindex DataFrame. This is a core pattern in Pandas, and there's a whole, whole lot to explore here.

Categories