I am programming in Python
I want to count how many times each word appears in a column. Coulmn 4 of my .csv file contains cca. 7 different words and need to know how many times each one appears. Eg. there are 700 lines and I need to count how many times the phrase HelloWorld appears in column 4.
You can use pandas.Series.value_counts() on the column you want. Since you mentioned it's the fourth column, you can get it by index using iloc as well. Of course you have to install pandas as it's not from the standard library, e.g. using pip with pip install pandas if you haven't already.
An example:
import pandas as pd
df = pd.read_csv("path/to/file.csv")
forth_column = df.iloc[:, 3] # Gets all rows for the fourth column (index starts at 0)
counts = forth_column.value_counts()
print(counts) # You'll see the number of times each string appears in the column
# The keys are the strings and the values are the number of times they appear
hello_world_counts = counts["HelloWorld"]
Related
I have an excel file with over 3700 entries in a single column. The entries contain a full name and number with only whitespace to separate the two. To extract the number I use the following:
import re
import pandas as pd
dataframe = pd.read_excel('example.xlsx')
index_number = re.findall(r'\d+', str(dataframe['entry']))
The output gives only 10 numbers - the first and last 5, but if the same code is run on an excel sheet with only 50 entries, the output includes all 50 numbers.
Any ideas what is going wrong?
The problem with your code is related to way of how pandas convert index to string. If it's of moderate length, then full data in printed in column. When it's longer, only first and last 5 items are printed with ellipsis in place of others.
To extract numbers, you can apply extracting function to each cell independently (it's much more efficient than coercing whole column to string). Also note that we can compile regex in advance to speed things up even more. With help of astype you'll have integers instead of their string representation.
import re
import pandas as pd
dataframe = pd.read_excel('example.xlsx')
regex = re.compile(r'(\d+)')
dataframe['number'] = (dataframe['entry'].apply(lambda el: regex.search(el).group(1))
.astype(int))
I have this big csv file that has data from an experiment. The first part of each person's response is a trial part that doesn't have the time they took for each response and I don't need that. After that part, the data adds another column which is the time, and those are the rows I need. So, basically, the csv has a lot of unusable data that has 9 columns instead of 10 and I need only the data with the 10 columns. How can I manage to grab that data instead of all of it?
As an example of it, the first row shows the data without the time column (second to last) and the second row the data I need with the time column added. I only need all the second rows basically, which is thousands of them. Any tips would be appreciated.
1619922425,5fe43773223070f515613ba23f3b770c,PennController,7,0,experimental-trial2,NULL,PennController,9,_Trial_,End,1619922289638,FLOR, red, r,NULL
1619922425,5fe43773223070f515613ba23f3b770c,PennController,55,0,experimental-trial,NULL,PennController,56,_Trial_,Start,1619922296066,CASA, red, r,1230,NULL
Read the CSV using pandas. Then filter by using df[~df.time.isna()] to select all rows with non NaN values in the "time" column.
You can change this to filter based on the presence of data in any column. Think of it as a mask (i.e. mask = (~df.time.isna()) flags rows as True/False depending on the condition.
One option is to load the whole file and then keep only valid data:
import pandas as pd
df = pd.read_csv("your_file.csv")
invalid_rows = df.iloc[:,-1].isnull() # Find rows, where last column is not valid (missing)
df = df[~invalid_rows] # Select only valid rows
If you have columns named, then you can use df['column_name'] instead of df.iloc[:,-1].
Of course it means you first load the full dataset, but in many cases this is not a problem.
I am a newbie in Python.
Say I have the following very simple table saved in a text file:
one two three
four five six
seven eight nine
I can view this table using pandas:
import pandas as pd
df = pd.read_csv('table.txt')
print(df)
The outcome of this is:
one two three
0 four five six
1 seven eight nine
Now, I want to delete the first row (i.e. one two three). My first thought was to write:
df.drop([0])
Because rows and columns are numbered starting from 0. But what this does is that it drops the second row (i.e. four five six).
Thus the title of this question. How to remove a row that has no index number? Because as seen from print(df), the first row is not given an index.
I did try to Google this, but I was not able to find an answer. This is probably because I do not know some keywords.
One Two Three is the header of CSV file.
So to skip it, write the code mentioned below:
df = pd.read_csv('table.txt', header=none)
I'm having trouble with a thought process on a single to multiple columns using pandas. I have a main column that could have up to ten words separated by commas. I only have eight columns to split out these words to (no more).
I'm Currently using the code below to split out words into multiple columns. This code works as long as I know exactly how many words is in the longest cell. Example: In this case below, one of the cells in the original file will have exactly eight words in order for this to work properly. Otherwise, I will get an error ( Columns must be same length as key ). In testing, I have found that I must have the same number of columns needed to split the longest cell with the same number of words. No more, no less.
df[['column1','column2','column3','column4','column5','column6','column7','column8']] =
df['main'].str.split(',',expand=True)
What I'd like to see happen is a way to not worry about how many words are in the cells of the main column. If longest cell contains 6 words then split them out to 6 columns. If longest cell contains 8 words then split them out to 8 columns. If longest cell contains 10 words then drop last two words and split the rest out using 8 columns.
About the original file main column. I will not know how many words exist in each of the cells. I just have 8 columns so the first eight (if that many) get the honor of splitting to a column. The rest of the words (if any) will get dropped.
Question, Why do I need to have the exact amount of columns in the code above if my longest cell with words doesn't exceed that of my columns? I'm not understanding something.
Any help with the logic would be appreciated.
cols = df[['column1','column2','column3','column4','column5','column6','column7','column8']]
df2 = df['main'].str.split(',',expand=True, n=8)
#df = df.assign(**df2.set_axis(cols[:df2.shape[1]], axis=1))
#-------
if 0 in df2.columns:
df['column1']= np.where(df2[0].isnull(), df['column1'], df2[0])
You can use n=8 and then split the last column
df2 = df['main'].str.split(',', expand=True, n=8)
df = df.assign(**df2.set_axis(df.columns[:df2.shape[1]], axis=1))
df['column8'] = df['column8'].str.split(',').str[0]
You can use a list of labels instead df.columns if you don't want save the result in the first df2.shape[1] columns of df
I have two CSV's, each with about 1M lines, n number of columns, with identical columns. I want the most efficient way to compare the two files to find where any difference may lie. I would prefer to parse this data with Python rather than use any excel-related tools.
Are you using pandas?
import pandas as pd
df = pd.read_csv('file1.csv')
df = df.append(pd.read_csv('file2.csv'), ignore_index=True)
# array indicating which rows are duplicated
df[df.duplicated()]
# dataframe with only unique rows
df[~df.duplicated()]
# dataframe with only duplicate rows
df[df.duplicated()]
# number of duplicate rows present
df.duplicated().sum()
An efficient way would be to read each line from the first file(with less number of lines) and save in an object like Set or Dictionary, where you can access using O(1) complexity.
And then read lines from the second file and check if it exists in the Set or not.