Removing special characters from column headers [duplicate] - python

This question already has answers here:
How to flatten a hierarchical index in columns
(19 answers)
Closed 1 year ago.
I used to_flat_index() to flatten columns and ended up with column names like ('Method', 'sum'). I am trying to remove the special characters from these. But when I try to remove them, it changes all the column names to nan
function attempted:
df_pred.columns = df_pred.columns.str.replace("[(,),']", '')
Expected outcome: MethodSum

It seems your columns are multi-indexed because your use to_flat_index.
>>> df
bar baz foo qux
one two one two one two one two
0 0.713825 0.015553 0.036683 0.388443 0.729509 0.699883 0.125998 0.407517
1 0.820843 0.259039 0.217209 0.021479 0.845530 0.112166 0.219814 0.527205
2 0.734660 0.931206 0.651559 0.337565 0.422514 0.873403 0.979258 0.269594
3 0.314323 0.857317 0.222574 0.811631 0.313495 0.315072 0.354784 0.394564
4 0.672068 0.658103 0.402914 0.430545 0.879331 0.015605 0.086048 0.918678
Try:
>>> df.columns.to_flat_index().map(''.join)
Index(['barone', 'bartwo', 'bazone', 'baztwo',
'fooone', 'footwo', 'quxone', 'quxtwo'],
dtype='object')

Related

Efficiently labelling a column that contains repeated elements [duplicate]

This question already has answers here:
How to efficiently assign unique ID to individuals with multiple entries based on name in very large df
(3 answers)
Pandas: convert categories to numbers
(6 answers)
Convert pandas series from string to unique int ids [duplicate]
(2 answers)
Closed 1 year ago.
This post was edited and submitted for review 1 year ago and failed to reopen the post:
Original close reason(s) were not resolved
I have a dataframe with a column consisting of author names, where sometimes the name of an author repeats. My problem is: I want to assign a unique number to each author name in a corresponding parallel column (for simplicity, assume that this numbering follows the progression of whole numbers, starting with 0, then 1, 2, 3, and so on).
I can do this using nested FOR loops, but with 57000 records consisting of 500 odd unique authors, it is taking way too long. Is there a quicker way to do this?
For example,
Original DataFrame contains:
**Author**
Name 1
Name 2
Name 1
Name 3
I want another column added next to it, such that:
**Author** **AuthorID*
Name 1 1
Name 2 2
Name 1 1
Name 3 3

How to filter pandas dataframe based on length of a list in a column? [duplicate]

This question already has answers here:
How to filter a pandas dataframe based on the length of a entry
(2 answers)
Closed 1 year ago.
I have a pandas DataFrame like this:
id subjects
1 [math, history]
2 [English, Dutch, Physics]
3 [Music]
How to filter this dataframe based on the length of the column subjects?
So for example, if I only want to have rows where len(subjects) >= 2?
I tried using
df[len(df["subjects"]) >= 2]
But this gives
KeyError: True
Also, using loc does not help, that gives me the same error.
Thanks in advance!
Use the string accessor to work with lists:
df[df['subjects'].str.len() >= 2]
Output:
id subjects
0 1 [math, history]
1 2 [English, Dutch, Physics]

extract semicolon separated value from pandas df column [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 2 years ago.
I need to extract a specific value from pandas df column. The data looks like this:
row my_column
1 artid=delish.recipe.45064;artid=delish_recipe_45064;avb=83.3;role=4;data=list;prf=i
2 ab=px_d_1200;ab=2;ab=t_d_o_1000;artid=delish.recipe.23;artid=delish;role=1;pdf=true
3 dat=_o_1000;artid=delish.recipe.23;ar;role=56;passing=true;points001
The data is not consistent, but separated by a comma and I need to extract role=x.
I separated the data by a semicolon. And can loop trough the values to fetch the roles, but was wondering if there is a more elegant way to solve it.
Desired output:
row my_column
1 role=4
2 role=1
3 role=56
Thank you.
You can use str.extract and pass the required pattern within parentheses.
df['my_column'] = df['my_column'].str.extract('(role=\d+)')
row my_column
0 1 role=4
1 2 role=1
2 3 role=56
This should work:
def get_role(x):
l=x.split(sep=';')
t=[i for i in l if i[:4]=='role')][0]
return t
df['my_column']=[i for i in map(lambda y: get_role(y), df['my_column'])]

Merging dataframes together in a for loop [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 4 years ago.
I have a dictionary of pandas dataframes, each frame contains timestamps and market caps corresponding to the timestamps, the keys of which are:
coins = ['dashcoin','litecoin','dogecoin','nxt']
I would like to create a new key in the dictionary 'merge' and using the pd.merge method merge the 4 existing dataframes according to their timestamp (I want completed rows so using 'inner' join method will be appropriate.
Sample of one of the data frames:
data2['nxt'].head()
Out[214]:
timestamp nxt_cap
0 2013-12-04 15091900
1 2013-12-05 14936300
2 2013-12-06 11237100
3 2013-12-07 7031430
4 2013-12-08 6292640
I'm currently getting a result using this code:
data2['merged'] = data2['dogecoin']
for coin in coins:
data2['merged'] = pd.merge(left=data2['merged'],right=data2[coin], left_on='timestamp', right_on='timestamp')
but this repeats 'dogecoin' in 'merged', however if data2['merged'] is not = data2['dogecoin'] (or some similar data) then the merge function won't work as the values are non existent in 'merge'
EDIT: my desired result is create one merged dataframe seen in a new element in dictionary 'data2' (data2['merged']), containing the merged data frames from the other elements in data2
Try replacing the generalized pd.merge() with actual named df but you must begin dataframe with at least a first one:
data2['merged'] = data2['dashcoin']
# LEAVE OUT FIRST ELEMENT
for coin in coins[1:]:
data2['merged'] = data2['merged'].merge(data2[coin], on='timestamp')
Since you've already made coins a list, why not just something like
data2['merged'] = data2[coins[0]]
for coin in coins[1:]:
data2['merged'] = pd.merge(....
Unless I'm misunderstanding, this question isn't specific to dataframes, it's just about how to write a loop when the first element has to be treated differently to the rest.

How to remove ellipsis from a row in a Python Pandas series or data frame, shown when long lines/wide columns are truncated? [duplicate]

This question already has answers here:
How can I display full (non-truncated) dataframe information in HTML when converting from Pandas dataframe to HTML?
(10 answers)
How do I expand the output display to see more columns of a Pandas DataFrame?
(22 answers)
Closed 7 months ago.
When I create the following Pandas Series:
pandas.Series(['a', 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa', 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa', 'aaaaaaaaaaaaaaaa', 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa']
I get this as a result:
0 a
1 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa...
2 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa...
3 aaaaaaaaaaaaaaaa
4 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa...
How can I instead get a Series without the ellipsis that looks like this:
0 a
1 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
2 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
3 aaaaaaaaaaaaaaaa
4 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
pandas is truncating the output, you can change this:
In [4]:
data = pd.Series(['a', 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa', 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa', 'aaaaaaaaaaaaaaaa', 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'])
pd.set_option('display.max_colwidth',1000)
data
Out[4]:
0 a
1 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
2 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
3 aaaaaaaaaaaaaaaa
4 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
dtype: object
also see related: Output data from all columns in a dataframe in pandas and Output data from all columns in a dataframe in pandas
By the way if you are using IPython then if you do a docstring lookup (by pressing tab) then you will see the current values and the default values (the default is 50 characters).
For Pandas versions older than 0.10 use
pd.set_printoptions(max_colwidth, 1000)
See related: Python pandas, how to widen output display to see more columns?

Categories