pandas - single column to multiple columns (getting key length error) - python

I'm having trouble with a thought process on a single to multiple columns using pandas. I have a main column that could have up to ten words separated by commas. I only have eight columns to split out these words to (no more).
I'm Currently using the code below to split out words into multiple columns. This code works as long as I know exactly how many words is in the longest cell. Example: In this case below, one of the cells in the original file will have exactly eight words in order for this to work properly. Otherwise, I will get an error ( Columns must be same length as key ). In testing, I have found that I must have the same number of columns needed to split the longest cell with the same number of words. No more, no less.
df[['column1','column2','column3','column4','column5','column6','column7','column8']] =
df['main'].str.split(',',expand=True)
What I'd like to see happen is a way to not worry about how many words are in the cells of the main column. If longest cell contains 6 words then split them out to 6 columns. If longest cell contains 8 words then split them out to 8 columns. If longest cell contains 10 words then drop last two words and split the rest out using 8 columns.
About the original file main column. I will not know how many words exist in each of the cells. I just have 8 columns so the first eight (if that many) get the honor of splitting to a column. The rest of the words (if any) will get dropped.
Question, Why do I need to have the exact amount of columns in the code above if my longest cell with words doesn't exceed that of my columns? I'm not understanding something.
Any help with the logic would be appreciated.
cols = df[['column1','column2','column3','column4','column5','column6','column7','column8']]
df2 = df['main'].str.split(',',expand=True, n=8)
#df = df.assign(**df2.set_axis(cols[:df2.shape[1]], axis=1))
#-------
if 0 in df2.columns:
df['column1']= np.where(df2[0].isnull(), df['column1'], df2[0])

You can use n=8 and then split the last column
df2 = df['main'].str.split(',', expand=True, n=8)
df = df.assign(**df2.set_axis(df.columns[:df2.shape[1]], axis=1))
df['column8'] = df['column8'].str.split(',').str[0]
You can use a list of labels instead df.columns if you don't want save the result in the first df2.shape[1] columns of df

Related

Counting elements in specified column of a .csv file

I am programming in Python
I want to count how many times each word appears in a column. Coulmn 4 of my .csv file contains cca. 7 different words and need to know how many times each one appears. Eg. there are 700 lines and I need to count how many times the phrase HelloWorld appears in column 4.
You can use pandas.Series.value_counts() on the column you want. Since you mentioned it's the fourth column, you can get it by index using iloc as well. Of course you have to install pandas as it's not from the standard library, e.g. using pip with pip install pandas if you haven't already.
An example:
import pandas as pd
df = pd.read_csv("path/to/file.csv")
forth_column = df.iloc[:, 3] # Gets all rows for the fourth column (index starts at 0)
counts = forth_column.value_counts()
print(counts) # You'll see the number of times each string appears in the column
# The keys are the strings and the values are the number of times they appear
hello_world_counts = counts["HelloWorld"]

split strings from a column in separate columns

I am trying to split string values from a column, in as many columns as strings are in each row.
I am creating a new dataframe with three columns and I have the string values in the third column, I want to split in new columns (which already have headers) but the numbers of strings which are separated by semicolon, is different in each row
If I use this code:
df['string']= df['string'].str.split(';', expand=True)
then I will have left only one value in the column while the rest of the string values will not be split but eliminated.
Cal u advice on how this line of code should be modified in order to have the right output?
many thanks in advance
Instead of overwriting the original column, you can take the result of split and join with original DataFrame
df = pd.DataFrame({'my_string':['car;war;bus','school;college']})
df = df.join(df['my_string'].str.split(';',expand=True))
print(df)
my_string 0 1 2
0 car;war;bus car war bus
1 school;college school college None
Then we do
df['string']= df['string'].str.split(';', expand=True).str[0]

Concatinating the entire row in Pandas

I'm new to Pandas. I want to join entire row of strings to a paragraph, where each row comes up with only paragraph, instead for a number of columns. For example, if 10 columns and 5 rows are there in a Dataframe, I want the output to be 5 rows with a single column combining all 10 column's String Data to a single column. I actually need it to use Bag of Words/ TF-IDF to it for sentiment analysis. Need help to do it.
I thought of a solution taken for row 1 may be
' '.join(str(x) for x in df.iloc(1,0:11))
if any better solution is there, it will be more helpful for me

Pandas - Read only first few lines of each rows

I have a large CSV file with about 10000 rows of text information and each row of my dataset consists of a number of lines. However I just want to read say first 20 lines of each of the rows of my CSV file.
I came across n_rows parameter in pandas read_csv method which helps us in limiting the number of rows of the dataset that we would like to load. Is there also a way where we could only read first 20 lines of data from each of the rows in pandas?
You can read in the csv with df = pd.read_csv('path\file.csv') and than just select the first 20 rows by df_new = df.head(20). Is that what you where thinking of?
If I get your question correctly, your CSV file has multiple rows, where each row has multiple lines separated by the newline character '\n'. And you want to choose the first (say, for example) 3 lines from each row.
This can be achieved as:
# Read in CSV file using pandas-
data = pd.read_csv("example.csv")
# The first two rows (toy example) of dataset are-
data.iloc[0,0]
# 'Hello, this is first line\nAnd this is the second line\nThird and final line'
data.iloc[1,0]
# 'Today is 3rd June\nThe year is 2020\nSummer weather'
# First row's first line-
data.iloc[0,0].split("\n")[0]
# 'Hello, this is first line'
# First row's first two lines-
data.iloc[0,0].split("\n")[0:2]
# ['Hello, this is first line', 'And this is the second line']
The general syntax to get the first 'n' lines from row 'x' (assuming that the first column has the string data) is:
data.iloc[x,0].split("\n")[:n]
To select the first 'm' lines (assuming there are m lines or more) from the first 'x' rows, use the code:
data.iloc[:x, 0].apply(lambda y: y.split("\n")[0:m])
Does this help?
If TiTo's answer is not what you are looking for, maybe the iloc method is. You can store the first 20 rows by doing firstRows = df.iloc[:20].
However, if you only ever need the first 20 rows, you shouldn't load the whole file into memory. As you mentioned, this can be achieved with the nrows parameter.

combine the values of a specific column of a dataframe in one row or unit

I want to combine the different values/rows of a certain column. these values are texts and I want to combine them together to perform word count and find the most common words.
the dataframe is called df and is made of 30 columns. I want to combine all the rows of the first column (labeled 'text') into one row, or one list etc,. it doesn't matter as long as I can perform FreqDist on it. I am not interested in grouping the values according to a certain value, I just want all the values in this column to become one block.
I looked around a lot and I couldn't find what I am looking for.
thanks a lot.

Categories