Suppose I have the following table:
enter image description here
I am currently using this line to create a new column:
MyDataFrame['PersonAge'] = MyDataFrame.apply(lambda row: "({},{})".format(row['Person'],['Age']), axis=1)
my goal is to have a column consisting of something like: (John, 24.0)
after that line when i MyDataFrame.head() this is the last column i see: (John, ['Age'])
This is true for all rows. for example in the next row i have: (Myla, ['Age'])
Any idea what could be the issue? I copied the column name from my table hoping it was a typo of some sort, but I got the same result.
I would appreciate any help (or a new way to make a "pair" of the previous data)! :)
It seems like I forgot to put row before ['Age'].
The answer should be:
MyDataFrame['PersonAge'] = MyDataFrame.apply(lambda row: "({},{})".format(row['Person'],row['Age']), axis=1)
I checked the code and this one works great! :)
Related
I have a large dataframe and am trying to add a leading (far left, 0th position) column for descriptive purposes. The dataframe and column which I'm trying to insert both have the same number of lines.
The column I'm inserting looks like this:
Description 1
Description 2
Description 3
.
.
.
Description n
The code I'm using to attach the column is:
df.insert(loc=0, column='description', value=columnToInsert)
The code I'm using to write to file is:
df.to_csv('output', sep='\t', header=None, index=None)
(Note: I've written to file with and without the "header=None" option, doesn't change my problem)
Now after writing to file, what I end up getting is:
Description 2 E11 ... E1n
Description 3 E21 ... E2n
.
.
.
Description n E(n-1)1... E(n-1)n
NaN En1 ... Enn
So the first element of my descriptive, leading column is deleted, all the descriptions are off by one, and the last row has "not a number" as it's description.
I have no idea what I'm doing which might cause this, and I'm not really sure where to start in correcting it.
Figured it out. The issue was stemming from the fact that I had deleted a row from my large dataframe prior to inserting my descriptive column, this was causing the indices to line up improperly.
So now I included the line:
df.reset_index(drop=True, inplace=True)
Everything lines up properly now and no elements are deleted!
I have the following code to generate a .csv File:
sfdc_dataframe.to_csv('sfdc_data_demo.csv',index=False,header=True)
It is just one column, how could I get the last value of the column, and delete the last comma in the value?
Example image of input data:
https://i.stack.imgur.com/M5nVO.png
And the result that im try to make:
https://i.stack.imgur.com/fEOXM.png
Anyone have an idea or tip?
Thanks!
Once after reading csv file in dataframe(logic shared by you), you can use below logic which is specifically for last row of your specific column replace.
sfdc_dataframe['your_column_name'].iat[-1]=sfdc_dataframe['your_column_name'].iat[-1].str[:-1]
Updated answer below as it only required to change value of the last row.
val = sfdc_dataframe.iloc[-1, sfdc_dataframe.columns.get_loc('col')]
sfdc_dataframe.iloc[-1, sfdc_dataframe.columns.get_loc('col')] = val[:-1]
Easy way
df = pd.read_csv("df_name.csv", dtype=object)
df['column_name'] = df['column_name'].str[:-1]
Hello everyone need one help regarding the below question,
My dataframe looks like below and I am using pyspark,
the time column needs to be split into two columns 'start time' and 'end time' like below,
I tried couple of methods like the self joining the df on m_id but it looks very tedious and inefficient, I would appreciate if someone can help me on this
Thanks in advance
Performing something based on row order in Spark is not a good idea. Row order is preserved while reading the file but it may get shuffled in between transformations and then there's no way to know which was the previous row(start time). You need to ensure that no shuffling happens in order to avoid this but that will lead to some other complexities.
my suggestion is to work on the file at source level and try to put row numbers column like
r_n m_id time
0 2 2022-01-01T12:12:12.789+000
1 2 2022-01-01T12:14:12.789+000
2 2 2022-01-01T12:16:12.789+000
later in spark you make a left self join with r_n like
df1=df.withColumn("next_r",col("r_n") + lit(1) )
df_final = df.join(df1,df1.next_r == df.r_n , "left" ).select(df("m_id"),df("time").as("start_time"),df1("time").as("end_time") )
I'm trying to learn more about re-indexing.
For background context, I have a data frame called sleep_cycle.
In this data frame, the columns are: Name, Age, Average sleep time.
I want to pick out only those who's names begin with the letter 'B'.
I then want to re-index these 'B' people, so that I have a new data frame that has the same columns, but only has those who's name begins with B.
Here was my attempt to do it:
info = list(sleep_cycle.columns) #this is just to set a list of the existing columns
b_names = [name for name in sleep_cycle['name'] if name[0] == 'B']
b_sleep_cycle = sleep_cycle.reindex(b_names, columns = info) #re-index with the 'B' people, and set columns to the list I saved earlier.
Results: Re-indexing was succesful, managed to pick those who only began with the letter 'B', and the columns remained the same. Great! Problem was: All the data has been replaced with NaN.
Can someone help me with this one? What am I doing wrong? It would be best appreciated if you could suggest a solution that is only in one line of code.
Based on your description (example data and expected output would be better), this would work:
sleep_cycle[sleep_cycle['name'].str.startswith['B']]
i am not very used to programming and need some help to solve a problem.
I have a .csv with 4 columns and about 5k rows, filled with questions and answers.
I want to find word collocations in each cell.
Starting point: Pandas dataframe with 4 columns and about 5k rows. (Id, Title, Body, Body2)
Goal: Dataframe with 7 columns (Id, Title, Title-Collocations, Body, Body_Collocations, Body2, Body2-Collocations) and applied a function on each of its rows.
I have found an example for Bigramm Collocation in the NLTK Documentation.
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder.apply_freq_filter(3)
finder = BigramCollocationFinder.from_words(nltk.corpus.genesis.words('english-web.txt'))
print (finder.nbest(bigram_measures.pmi, 5))
>>>[('Beer', 'Lahai'), ('Lahai', 'Roi'), ('gray', 'hairs'), ('Most', 'High'), ('ewe', 'lambs')]
I want to adapt this function to my Pandas Dataframe. I am aware of the apply function for Pandas Dataframes, but can't manage to get it work.
This is my test-approach for one of the columns:
df['Body-Collocation'] = df.apply(lambda df: BigramCollocationFinder.from_words(df['Body']),axis=1)
but if i print that out for an example row i get
print (df['Body-Collocation'][1])
>>> <nltk.collocations.BigramCollocationFinder object at 0x113c47ef0>
I am not even sure if this is the right way. Can someone point me to the right direction?
If you want to apply BigramCollocationFinder.from_words() to each value in the Body `column, you'd have to do:
df['Body-Collocation'] = df.Body.apply(lambda x: BigramCollocationFinder.from_words(x))
In essence, apply allows you to loop through the rows and provide the corresponding value of the Body column to the applied function.
But as suggested in the comments, providing a data sample would make it easier to address your specific case.
Thx, for the answer. I guess the question i asked was not perfectly phrased. But your answer still helped me to find a solution. Sometimes its good to take a short break :-)
If someone is interested in the answer. This worked out for me.
df['Body-Collocation'] = df.apply(lambda df: BigramCollocationFinder.from_words(df['Question-Tok']),axis=1)
df['Body-Collocation'] = df['Body-Collocation'].apply(lambda df: df.nbest(bigram_measures.pmi, 3))