Concatinating the entire row in Pandas - python

I'm new to Pandas. I want to join entire row of strings to a paragraph, where each row comes up with only paragraph, instead for a number of columns. For example, if 10 columns and 5 rows are there in a Dataframe, I want the output to be 5 rows with a single column combining all 10 column's String Data to a single column. I actually need it to use Bag of Words/ TF-IDF to it for sentiment analysis. Need help to do it.

I thought of a solution taken for row 1 may be
' '.join(str(x) for x in df.iloc(1,0:11))
if any better solution is there, it will be more helpful for me

Related

Slicing Pandas DataFrame every nth row

I have a CSV file, which I read in as a pandas DataFrame. I want to slice the data after every nth row and store the it as an individual CSV.
My data looks somewhat like this:
index,acce_x,acce_y,acce_z,grav_x,grav_y,grav_z
0,-2.53406373,6.92596131,4.499464420000001,-2.8623820449999995,7.850541115,5.129520459999999
1,-2.3442032099999994,6.878311170000001,5.546690349999999,-2.6456542850000004,7.58022081,5.62603916
2,-1.8804458600000005,6.775125179999999,6.566146829999999,-2.336306185,7.321197125,6.088656729999999
3,-1.7059021099999998,6.650866649999999,7.07060242,-2.1012737650000006,7.1111130000000005,6.416324900000001
4,-1.6802886999999995,6.699703990000001,7.15823367,-1.938001715,6.976289879999999,6.613534820000001
5,-1.6156433,6.871610960000001,7.13333286,-1.81060772,6.901037819999999,6.72789553
6,-1.67286072,7.005918899999999,7.22047422,-1.722352455,6.848503825,6.8044359100000005
7,-1.56608278,7.136883599999999,7.150566069999999,-1.647941205,6.821055315,6.850329440000001
8,-1.3831649899999998,7.2735946999999985,6.88074028,-1.578703155,6.821634375,6.866061665000001
9,-1.25986478,7.379898050000001,6.590330490000001,-1.5190086099999998,6.839881785,6.861375744999999
10,-1.1101097050000002,7.48500525,6.287461959999999,-1.4641099750000002,6.870566649999999,6.842625039999999
For example, I would like to slice it after every 5th row and store the rows indexed 1-4 and 5-9 each in a single CSV (so in this case I would get 2 new CSVs), row 10 should be discarded.
One issue is that I'll have to apply this to multiple files which differ in length as well as naming the newly created CSVs.
You can do it with a for loop:
for i in range(round(len(df)/5)): #This ensures all rows are captured
df.loc[i*5:(i+1)*5,:].to_csv('Stored_files_'+str(i)+'.csv')
So the first iteration it'll be rows 0 to 5 stored with name "Stored_files_0.csv
The second iteration rows 5 to 10 with name "Stored_files_1.csv"
And so on...

pandas - single column to multiple columns (getting key length error)

I'm having trouble with a thought process on a single to multiple columns using pandas. I have a main column that could have up to ten words separated by commas. I only have eight columns to split out these words to (no more).
I'm Currently using the code below to split out words into multiple columns. This code works as long as I know exactly how many words is in the longest cell. Example: In this case below, one of the cells in the original file will have exactly eight words in order for this to work properly. Otherwise, I will get an error ( Columns must be same length as key ). In testing, I have found that I must have the same number of columns needed to split the longest cell with the same number of words. No more, no less.
df[['column1','column2','column3','column4','column5','column6','column7','column8']] =
df['main'].str.split(',',expand=True)
What I'd like to see happen is a way to not worry about how many words are in the cells of the main column. If longest cell contains 6 words then split them out to 6 columns. If longest cell contains 8 words then split them out to 8 columns. If longest cell contains 10 words then drop last two words and split the rest out using 8 columns.
About the original file main column. I will not know how many words exist in each of the cells. I just have 8 columns so the first eight (if that many) get the honor of splitting to a column. The rest of the words (if any) will get dropped.
Question, Why do I need to have the exact amount of columns in the code above if my longest cell with words doesn't exceed that of my columns? I'm not understanding something.
Any help with the logic would be appreciated.
cols = df[['column1','column2','column3','column4','column5','column6','column7','column8']]
df2 = df['main'].str.split(',',expand=True, n=8)
#df = df.assign(**df2.set_axis(cols[:df2.shape[1]], axis=1))
#-------
if 0 in df2.columns:
df['column1']= np.where(df2[0].isnull(), df['column1'], df2[0])
You can use n=8 and then split the last column
df2 = df['main'].str.split(',', expand=True, n=8)
df = df.assign(**df2.set_axis(df.columns[:df2.shape[1]], axis=1))
df['column8'] = df['column8'].str.split(',').str[0]
You can use a list of labels instead df.columns if you don't want save the result in the first df2.shape[1] columns of df

Pandas - Read only first few lines of each rows

I have a large CSV file with about 10000 rows of text information and each row of my dataset consists of a number of lines. However I just want to read say first 20 lines of each of the rows of my CSV file.
I came across n_rows parameter in pandas read_csv method which helps us in limiting the number of rows of the dataset that we would like to load. Is there also a way where we could only read first 20 lines of data from each of the rows in pandas?
You can read in the csv with df = pd.read_csv('path\file.csv') and than just select the first 20 rows by df_new = df.head(20). Is that what you where thinking of?
If I get your question correctly, your CSV file has multiple rows, where each row has multiple lines separated by the newline character '\n'. And you want to choose the first (say, for example) 3 lines from each row.
This can be achieved as:
# Read in CSV file using pandas-
data = pd.read_csv("example.csv")
# The first two rows (toy example) of dataset are-
data.iloc[0,0]
# 'Hello, this is first line\nAnd this is the second line\nThird and final line'
data.iloc[1,0]
# 'Today is 3rd June\nThe year is 2020\nSummer weather'
# First row's first line-
data.iloc[0,0].split("\n")[0]
# 'Hello, this is first line'
# First row's first two lines-
data.iloc[0,0].split("\n")[0:2]
# ['Hello, this is first line', 'And this is the second line']
The general syntax to get the first 'n' lines from row 'x' (assuming that the first column has the string data) is:
data.iloc[x,0].split("\n")[:n]
To select the first 'm' lines (assuming there are m lines or more) from the first 'x' rows, use the code:
data.iloc[:x, 0].apply(lambda y: y.split("\n")[0:m])
Does this help?
If TiTo's answer is not what you are looking for, maybe the iloc method is. You can store the first 20 rows by doing firstRows = df.iloc[:20].
However, if you only ever need the first 20 rows, you shouldn't load the whole file into memory. As you mentioned, this can be achieved with the nrows parameter.

How to read a dataset into pandas and leave out rows with uneven column count

I am trying to read a dataset, which has few rows with uneven column count ('ragged'). I want to leave out those rows and read the rest of the rows. Is it possible in pandas instead of breaking the dataset into separate data frames and combining them?
If I understand your question, you have uneven columns, but want to drop any rows that don't have every column. If so, simply read the entire data set (read_csv) and then call dropna() on the dataframe. dropna() has a swarg called 'how' which defaults to 'any' ... that is, if any of the items in the given row (or column) are NA. (Consider also doing 'inplace=True'). See also: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html

combine the values of a specific column of a dataframe in one row or unit

I want to combine the different values/rows of a certain column. these values are texts and I want to combine them together to perform word count and find the most common words.
the dataframe is called df and is made of 30 columns. I want to combine all the rows of the first column (labeled 'text') into one row, or one list etc,. it doesn't matter as long as I can perform FreqDist on it. I am not interested in grouping the values according to a certain value, I just want all the values in this column to become one block.
I looked around a lot and I couldn't find what I am looking for.
thanks a lot.

Categories