I want to split the values of the columns "words" and "frequency" into multiple rows of the dataframe df.
[1]: Problem https://i.stack.imgur.com/7i1p6.png
I use the following piece of code to manipulate the data:
df = (df.set_index(["document"]).apply(lambda x: x.str.split(",").explode()).reset_index())
The problem I have identified is that the values in column "words" and "frequency" are in brackets e.g. (word1, word2, word3, wordn). The output after execution of the code is NaN.
The following solution is sought:
[2]: Solution: https://i.stack.imgur.com/XQqo1.png
you were close! The problem might be in the reset of indices... For a csv file looking like:
"document","words","frequency"
"document 1","(cat,dog,bird)","(12,34,354)"
"document 2","(berlin,new_york,paris)","(1,13,254)"
import pandas as pd
df = pd.read_csv(csv_file)
df2 = df.apply(lambda x: x.str.split(",").explode())
df3 = df2.apply(lambda x: x.str.replace("(","").explode())
df4 = df3.apply(lambda x: x.str.replace(")","").explode())
print(df4)
Maybe you can do it with only one function (not with a lambda one)
Related
I'm working on a project where I'm would like to use 2 lambda functions to find a match in another column. I created a dummy df with the following code below:
df = pd.DataFrame(np.random.randint(0,10,size=(100, 4)), columns=list('ABCD'))
Now I would like to find column A matches in column B.
df['match'] = df.apply(lambda x: x['B'].find(x['A']), axis=1).ge(0)
Now I would like to add an extra check where I'm also checking if column C values appear in column D:
df['match'] = df.apply(lambda x: x['D'].find(x['C']), axis=1).ge(0)
I'm searching for a solution where I can combine these 2 lines of code that is a one-liner that could be combined with an '&' operator for example. I hope this helps.
You can use and operator instead.
df['match'] = df.apply(lambda x: (x['B'] == x['A']) and (x['D'] == x['C']), axis=1).ge(0)
I'm trying to clean a file using pandas chaining and I have come to a point where I only need to clean one column and leave the rest as is, is there a way to accomplished this using pandas chaining using apply or pipe.
I have tried the following which works, but I only would like to replace the dash in one specific column and leave the the rest as is since the dash in other columns is appropriate.
df = (dataFrame
.dropna()
.pipe(lambda x: x.replace("-", "", regex=True))
)
I have also tried this, which doesn't work since it only returns the seatnumber column.
df = (dataFrame
.dropna()
.pipe(lambda x: x['seatnumber'].replace("-", "", regex=True))
)
Thanks is advance.
One way is to assign a column with the same name of the column of interest:
df = (dataFrame
.dropna()
.assign(seatnumber=lambda x: x.seatnumber.replace("-", "", regex=True))
)
where the dataframe at that point is passed to the lambda as x.
Let us pass a dict
df = (dataFrame
.dropna().replace({"seatnumber" : {"-":""}}, regex=True)
)
Working with a CSV file in PyCharm. I want to delete the automatically-generated index column. When I print it, however, the answer I get in the terminal is "None". All the answers by other users indicate that the reset_index method should work.
If I just say "df = df.reset_index(drop=True)" it does not delete the column, either.
import pandas as pd
df = pd.read_csv("music.csv")
df['id'] = df.index + 1
cols = list(df.columns.values)
df = df[[cols[-1]]+cols[:3]]
df = df.reset_index(drop=True, inplace=True)
print(df)
I agree with #It_is_Chris. Also,
This is not true because return is None:
df = df.reset_index(drop=True, inplace=True)
It's should be like this:
df.reset_index(drop=True, inplace=True)
or
df = df.reset_index(drop=True)
Since you said you're trying to "delete the automatically-generated index column" I could think of two solutions!
Fist solution:
Assign the index column to your dataset index column. Let's say your dataset has already been indexed/numbered, then you could do something like this:
#assuming your first column in the dataset is your index column which has the index number of zero
df = pd.read_csv("yourfile.csv", index_col=0)
#you won't see the automatically-generated index column anymore
df.head()
Second solution:
You could delete it in the final csv:
#To export your df to a csv without the automatically-generated index column
df.to_csv("yourfile.csv", index=False)
This is the input dataframe that I'm having.
And this is the output that I want:
As you can see, both the dataframes are merged on the column Key1 in such a way that the row with common elements are separated by commas.
I have tried using merge but it doesn't give the correct output.
mer = pd.merge(df,df, on='Key1', how='inner')
Is there a specific way to approach this?
You can convert values to strings and join unique values by , in custom lambda function:
Solution for working with missing values in Key1 in oldier pandas versions with replace by temp values:
df1 = (df.fillna({'Key1': 'missing'})
.groupby('Key1')
.agg(lambda x: ','.join(pd.unique(x.astype(str))))
.reset_index()
.replace({'Key1':{'missing':np.nan}}))
Solution for last pandas versions:
df1 = (df.groupby('Key1')
.agg(lambda x: ','.join(pd.unique(x.astype(str))))
.reset_index())
I have been checking each value of each row and if all of them are null, I delete the row with something like this:
df = pandas.concat([df[:2], df[3:]])
But, I am thinking there's got to be a better way to do this. I have been trying to use a mask or doing something like this:
rows_to_keep = df.apply(
lambda row :
any([if val is None for val in row ])
, axis=1)
I also tried something like this (suggested on another stack overflow question)
pandas.DataFrame.dropna()
but don't see any differences in my printed dataframe.
dropna returns a new DataFrame, you probably just want:
df = df.dropna()
or
df.dropna(inplace=True)
If you have a more complicated mask, rows_to_keep, you can do:
df = df[rows_to_keep]