I have several dataframes that have mixed in some columns with dates in this ASP.NET format "/Date(1239018869048)/". I've figured out how to parse this into python's datetime format for a given column. However I would like to put this logic into a function so that I can pass it any dataframe and have it replace all the dates that it finds that match a regex using pd.Dataframe.replace.
something like:
def pretty_dates():
#Messy logic here
df.replace(to_replace=r'\/Date(d+)', value=pretty_dates(df), regex=True)
Problem with this is that the df that is being passed to pretty_dates is the whole dataframe not just the cell that is needed to be replaced.
So the concept I'm trying to figure out is if there is a way that the value that should be replaced when using df.replace can be a function instead of a static value.
Thank you so much in advance
EDIT
To try to add some clarity, I have many columns in a dataframe, over a hundred that contain this date format. I would like not to list out every single column that has a date. Is there a way to apply the function the clean my dates across all the columns in my dataset? So I do not want to clean 1 column but all the hundreds of columns of my dataframe.
I'm sure you can use regex to do this in one step, but here is how to apply it to the whole column at once:
df = pd.Series(['/Date(1239018869048)/',
'/Date(1239018869048)/'],dtype=str)
df = df.str.replace('\/Date\(', '')
df = df.str.replace('\)\/', '')
print(df)
0 1239018869048
1 1239018869048
dtype: object
As far as I understand, you need to apply custom function to selected cells in specified column. Hope, that the following example helps you:
import pandas as pd
df = pd.DataFrame({'x': ['one', 'two', 'three']})
selection = df.x.str.contains('t', regex=True) # put your regexp here
df.loc[selection, 'x'] = df.loc[selection, 'x'].map(lambda x: x+x) # do some logic instead
You can apply this procedure to all columns of the df in a loop:
for col in df.columns:
selection = df.loc[:, col].str.contains('t', regex=True) # put your regexp here
df.loc[selection, col] = df.loc[selection, col].map(lambda x: x+x) # do some logic instead
Related
I have a column with tuples which I would like to remove the brackets from.
Example
words
(hello,me)
(what,can)
(ring, dog)
I have tried this:
df['words'].agg(','.join)
Unfortunately I receive the error in the title.
I would like this output:
words
hello,me
what,can
ring, dog
Any solution?
Also, strangely enough, with a different dataset that line of code works. Any ideas why?
I think you can use df.apply to update the words column with the new value by applying a function to modify the value of each row
import pandas as pd
df = pd.DataFrame({'words': [('hello','me'), ('what','can')]})
df['words'] = df.apply (lambda row: ','.join(row[0]), axis=1)
Edit: come to think of it, your original approach using df['words'].agg should also work but you need to assign it the words column for it to make change to the dataframe
import pandas as pd
df = pd.DataFrame({'words': [('hello','me'), ('what','can')]})
df['words'] = df['words'].agg(','.join)
print(df)
I am looking to select all values that include "hennessy" in the name, i.e. "Hennessy Black Cognac", "Hennessy XO". I know it would simply be
trial = Sales[Sales["Description"]if=="Hennessy"]
if I wanted only the value "Hennessy", but I want it if it contains the word "Hennessy" at all.
working on python with pandas imported
Thanks :)
You can use the in keyword to check if a value is present in a sequence.
Like this:
trial = "hennessy" in lower(Sales[Sales["Description"]])
you can try using str.startswith
import pandas as pd
# initialize list of lists
data = [['Hennessy Black Cognac', 10], ['Hennessy XO', 15], ['julian merger', 14]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Name', 'Age'])
new_df = df.loc[df.Name.str.startswith('Hennessy', na=False)]
new_df
or You can use apply to easily apply any string matching function to your column elementwise
df_new =df[df['Name'].apply(lambda x: x.startswith('Hennessy'))]
df_new
I want to split the rows while maintaing the values.
How can I split the rows like that?
The data frame below is an example.
the output that i want to see
You can use the pd.melt( ). Read the documentation for more information: https://pandas.pydata.org/docs/reference/api/pandas.melt.html
I tried working on your problem.
import pandas as pd
melted_df = data.melt(id_vars=['value'], var_name="ToBeDropped", value_name="ID1")
This would show a warning because of the unambiguity in the string passed for "value_name" argument. This would also create a new column which I have assigned the name already. The new column will be called 'ToBeDropped'. Below code will remove the column for you.
df = melted_df.drop(columns = ['ToBeDropped'])
'df' will be your desired output.
via wide_to_long:
df = pd.wide_to_long(df, stubnames='ID', i='value',
j='ID_number').reset_index(0)
via set_index and stack:
df = df.set_index('value').stack().reset_index(name='IDs').drop('level_1', 1)
via melt:
df = df.melt(id_vars='value', value_name="ID1").drop('variable', 1)
I have excel file and import to dataframe. I want to extract inside of column to several columns.
Here is original
After importing to pandas in python, I get this data with '\n'
So, I want to extract inside of column. Could you all share idea or code?
My expected columns are....
Don't worry no one is born knowing everything about SO. Considering the data you gave, specially that 'Vector:...' is not separated by '\n', the following works:
import pandas as pd
import numpy as np
data = pd.read_excel("the_data.xlsx")
ok = []
l = len(data['Details'])
for n in range(l):
x = data['Details'][n].split()
x[2] = x[2].lstrip('Vector:')
x = [v for v in x if v not in ['Type:', 'Mission:']]
ok += x
values = np.array(ok).reshape(l, 3)
df = pd.DataFrame(values, columns=['Type', 'Vector', 'Mission'])
data.drop('Details', axis=1, inplace=True)
final = pd.concat([data, df], axis=1)
The process goes like this:
First you split all elements of the Details columns as a list of strings. Second you deal with the 'Vector:....' special case and filter column names. Third you store all the values in a list which will inturn be converted to a numpy array with shape (length, 3). Finally you drop the old 'Details' column and perform a concatenation with the df created from splited strings.
You may want to try a more efficient way to transform your data when reading by trying to use this ideas inside the pd.read_excel method using converters
I'm trying to manipulate a dataframe using a cumsum function.
My data looks like this:
To perform my cumsum, I use
df = pd.read_excel(excel_sheet, sheet_name='Sheet1').drop(columns=['Material']) # Dropping material column
I run the rest of my code, and get my expected outcome of a dataframe cumsum without the material listed:
df2 = df.as_matrix() #Specifying Array format
new = df2.cumsum(axis=1)
print(new)
However, at the end, I need to replace this material column. I'm unsure how to use the add function to get this back to the beginning of the dataframe.
IIUC, then you can just set the material column to the index, then do your cumsum, and put it back in at the end:
df2 = df.set_index('Material').cumsum(1).reset_index()
An alternative would be to do your cumsum on all but the first column:
df.iloc[:,1:] = df.iloc[:,1:].cumsum(1)