Pandas dataframe groupby value_counts - python

I tried this code it is perfectly working , but when i remove "RIAGENDR" from
dx=dm.groupby(["RIAGENDR","RIDAGEYR"])["DMDMARTL"]
it show me error but why what is the reason ??
please help me with that !!!
dm=ds[ds["RIAGENDR"]=="male"]
dm.RIDAGEYR=pd.cut(dm.RIDAGEYR,[18,30,40,50,60,70,80,100])
dx=dm.groupby(["RIAGENDR","RIDAGEYR"])["DMDMARTL"]
dx=dx.value_counts()
dx=dx.unstack()
dx = dx.apply(lambda x: x/x.sum(), axis=0)
#dx=dx.to_string(float_format="%.3f")
dx
```

you should not use [] when you have single variable:
dx=dm.groupby(["RIAGENDR","RIDAGEYR"])["DMDMARTL"]
or one way is to use:
dx=dm.groupby(by=["RIAGENDR"])
you can get this on following link somewhere in Hierarchical Indexes:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html

You should remove the [] brackets from within the groupby operation.
To be clear, if you want to group by 2 variables you should use the below code which uses a list of variables:
dx=dm.groupby(["RIAGENDR","RIDAGEYR"])["DMDMARTL"]
If you want to group by 1 variable, you should not use a list of variables, just a single one:
dx=dm.groupby("RIAGENDR")["DMDMARTL"]

Related

Remove unwanted characters from Dataframe values in Pandas

I have the following Dataframe full of locus/gen names from a multiple genome alignment.
However, I am trying to get only a full list of the locus/name without the coordinates.
Tuberculosis_locus Smagmatis_locus H37RA_locus Bovis_locus
0 0:Rv0001:1-1524 1:MSMEG_RS33460:6986600-6988114 2:MRA_RS00005:1-1524 3:BQ2027_RS00005:1-1524
1 0:Rv0002:2052-3260 1:MSMEG_RS00005:499-1692 2:MRA_RS00010:2052-3260 3:BQ2027_RS00010:2052-3260
2 0:Rv0003:3280-4437 1:MSMEG_RS00015:2624-3778 2:MRA_RS00015:3280-4437 3:BQ2027_RS00015:3280-4437
To avoid issues with empty cells, I am filling cells with 'N/A' and then striping the unwanted characters. But it's giving the same exact result, nothing seems to be happening.
for value in orthologs['Tuberculosis_locus']:
orthologs['Tuberculosis_locus'] = orthologs['Tuberculosis_locus'].fillna("N/A")
orthologs['Tuberculosis_locus'] = orthologs['Tuberculosis_locus'].map(lambda x: x.lstrip('\d:').rstrip(':\d+'))
Any idea on what I am doing wrong? I'd like the following output:
Tuberculosis_locus Smagmatis_locus H37RA_locus Bovis_locus
0 Rv0001 MSMEG_RS33460 MRA_RS00005 BQ2027_RS00005
1 Rv0002 MSMEG_RS00005 MRA_RS00010 BQ2027_RS00010
2 Rv0003 MSMEG_RS00015 MRA_RS00015 BQ2027_RS00015
Split by : with a maximum split of two and then take the 2nd elements, eg:
df.applymap(lambda v: v.split(':', 2)[1])
def clean(x):
x = x.split(':')[1].strip()
return x
orthologs = orthologs.applymap(clean)
should work.
Explanation:
applymap is for the whole dataframe and apply is for a data column.
clean is a function you want to apply to every entry of the dataframe. Note that you don't need (x) anymore when you use it together with applymap or apply.

How to separate tuple into independent pandas columns?

I am working with matching two separate dataframes on first name using HMNI's fuzzymerge.
On output each row returns a key like: (May, 0.9905315373004635)
I am trying to separate the Name and Score into their own columns. I tried the below code but don't quite get the right output - every row ends up with the same exact name/score in the new columns.
for i, v in enumerate(matched.key):
matched['MatchedNameFinal'] = (matched.key[i][0][0])
matched['MatchedNameScore'] = (matched.key[i][0][1])
matched[['consumer_name_first', 'key','MatchedNameFinal', 'MatchedNameScore']]
first when going over rows in pandas is better to use apply
matched['MatchedNameFinal'] = matched.key.apply(lambda x: x[0][0])
matched['MatchedNameScore'] = matched.key.apply(lambda x: x[0][1])
and in your case I think you are missing a tab in the for loop
for i, v in enumerate(matched.key):
matched['MatchedNameFinal'] = (matched.key[i][0][0])
matched['MatchedNameScore'] = (matched.key[i][0][1])
Generally, you want to avoid using enumerate for pandas because pandas functions are vectorized and much faster to execute.
So this solution won't iterate using enumerate.
First you turn the list into single tuple per row.
matched.key.explode()
Then use zip to split the tuple into 2 columns.
matched['col1'], matched['col2'] = zip(tuples)
Do all in 1 line.
matched['MatchedNameFinal'], matched['MatchedNameScore'] = zip(*matched.key.explode())

How to assign objects to a columns using the .between fuction

I am trying to label a data frame by on-peak, mid-peak, off-peak etc. I managed to get the values I want to assign in this 'Mid-Peak', df['Peak'][df['func'] == 'Winter_Weekend']. However, when I include the .between_time I get the error: SyntaxError: can't assign to function call. I am not sure how to fix this. My goal is for the code code to work like this. Do I need another function or a do I need to change the syntax? Thank you for the help.
df['Peak'][df['func'] == 'Winter_Weekend'].between_time('16:00','21:00', include_end=False) = 'Mid-Peak'
In general, you can't assign a result to a function call, so need a different syntax. You could try
selection = df[df['func'] == 'Winter_Weekend'].between_time('16:00','21:00', include_end=False)
selection["Peak"] = "Mid-Peak"
But this doesn't update your original df, only the rows copied into selection.
To update the original dataframe, one way is to use loc to select both rows and a column, and .index to apply the between_time selection to the original dataframe:
ww = df["func"] == "Winter_Weekend"
df.loc[df[ww].between_time('16:00', '21:00', include_end=False).index, "Peak"] = "Mid-Peak"
I would recommend leveraging np.where() here, as follows:
df['Peak'] = np.where(df[df['func'] == 'Winter_Weekend'].between_time('16:00','21:00', include_end=False), 'Mid-Peak', df['Peak'])

Add an incrementing number to end of string in python

I am trying to add an incrementing number to the end of a string and apply this to every row in one column of my dataframe. The column has 76 rows. Ideally I want the output to look like this:
Event Reference Number
FEBRUARY_2019_1
FEBRUARY_2019_2
FEBRUARY_2019_3
I am trying to use apply with a lambda function but am stuck trying to figure out the next step.
df_Trade_File['Event Reference Number'].apply(lambda x: enumerate(df_Trade_File['Event Reference Number']))
First thing to note is that for the change to take affect you will need to set your df to the result at which time you could use apply or something to format it but in this situation you could probably do something simpler like:
df_Trade_File['Event Reference Number'] = df_Trade_File['Event Reference Number'] + '_' + str(df.index)
assuming you are using a standard index of 1-76.
Try using Series.str.cat
df["Event Reference Number"] = df["Event Reference Number"].str.cat(map(str, df.index), sep='_')
If you're using an index other than standard index replace map(str, df.index) with a range but make it all strings or python will yell at you:
map(str, range(1, df.shape[0]+1))

Tricky str value replacement within PANDAS DataFrame

Problem Overview:
I am attempting to clean stock data loaded from CSV file into Pandas DataFrame. The indexing operation I perform works. If I call print, I can see the values I want are being pulled from the frame. However, when I try to replace the values, as shown in the screenshot, PANDAS ignores my request. Ultimately, I'm just trying to extract a value out of one column and move it to another. The PANDAS documentation suggests using the .replace() method, but that doesn't seem to be working with the operation I'm trying to perform.
Here's a pic of the code and data before and after code is run.
And the for loop (as referenced in the pic):
for i, j in zip(all_exchanges['MarketCap'], all_exchanges['MarketCapSym']):
if 'M' in i: j = j.replace('n/a','M')
elif 'B' in i: j = j.replace('n/a','M')
The problem is that j is a string, thus immutable.
You're replacing data, but not in the original dataset.
You have to do it another way, less elegant, without zip (I simplified your test BTW since it did the same on both conditions):
aem = all_exchanges['MarketCap']
aems = all_exchanges['MarketCapSym']
for i in range(min(len(aem),len(aems)): # like zip: shortest of both
if 'M' in aem[i] or 'B' in aem[i]:
aems[i] = aems[i].replace('n/a','M')
now you're replacing in the original dataset.
If both columns are in the same dataframe, all_exchanges, iterate over the rows.
for i, row in enumerate ( all_exchanges ):
# get whatever you want from row
# using the index you should be able to set a value
all_exchanges.loc[i, 'columnname'] = xyz
That should be the syntax of I remember ;)
Here is quite exhaustive tutorial on missing values and pandas. I suggest using fillna():
df['MarketCap'].fillna('M', inplace=True)
df['MarketCapSym'].fillna('M', inplace=True)
Avoid iterating if you can. As already pointed out, you're not modifying the original data. Index on the MarketCap column and perform the replace as follows.
# overwrites any data in the MarketCapSym column
all_exchanges.loc[(all_exchanges['MarketCap'].str.contains('M|B'),
'MarketCapSym'] = 'M'
# only replaces 'n/a'
all_exchanges.loc[(all_exchanges['MarketCap'].str.contains('M|B'),
'MarketCapSym'].replace({'n/a', 'M'}, inplace=True)
Thanks to all who posted. After thinking about your solutions and the problem a bit longer, I realized there might be a different approach. Instead of initializing a MarketCapSym column with 'n/a', I instead created that column as a copy of MarketCap and then extracted anything that wasn't an "M" or "B".
I was able to get the solution down to one line:
all_exchanges['MarketCapSymbol'] = [ re.sub('[$.0-9]', '', i) for i in all_exchanges.loc[:,'MarketCap'] ]
A breakdown of the solution is as follows:
all_exchanges['MarketCapSymbol'] = - Make a new column on the DataFrame called 'MarketCapSymbol.
all_exchanges.loc[:,'MarketCap'] - Initialize the values in the new column to those in 'MarketCap'.
re.sub('[$.0-9]', '', i) for i in - Since all I want is the 'M' or 'B', apply re.sub() on each element, extracting [$.0-9] and leaving only the M|B.
Using a list comprehension this way seemed a bit more natural / readable to me in my limited experience with PANDAS. Let me know what you think!

Categories