I've placed the contents of a .csv file in the data list
with the following code in Jupyter notebook
data = []
with open("president_county_candidate.csv", "r") as f:
contents = csv.reader(f)
for c in contents:
data.append(c)
I can only get an element through the index number, but that gives me the whole row of the list. How can I choose specific elements and the count? In the image, you can see the content of the List(data).
data
You can use pandas library to read csv in form of Dataframes and also you can perform various operations on the same. Refer the below code:
import pandas as pd
df = pd.read_csv('president_county_candidate.csv')
print(df.shape)
This will give you the number of rows and columns present in CSV file.
Also in order to extract any specific column you can use:
newdf = df[['candidates', 'votes']]
This will give you the new dataframe with the above mentioned columns.
Also please find the solution for your approach mentioned in question
You can extract the column index and then while parsing the contents, mention the index number of the columns you need.
For example:
data = []
with open("president_county_candidate.csv", "r") as f:
contents = csv.reader(f)
for c in contents:
data.append([c[2], c[4]])
This will only get the data of Candidate and Votes.
Note: It's better to get the index number of specific column using list.index('columnName') and pass the same variable instead of hard-coded index.
Related
I have 2 files that contain partly the same items. To detect them there exists a unique identifier (UID).
What I try to achieve is to compare the UIDs in the first file, and compare them to the UIDs in the second file. If those are identical, another column in the first file should be filled with content in the second file of the respective column.
import pandas as pd
dfFile2 = pd.read_csv("File2.csv", sep=";")
dfFile1 = pd.read_csv("File1.csv", sep=";")
UIDURLS = dfFile2["UID"]
UIDKonf = dfFile1["UID"]
URLSurl = dfUrls["URL"]
URLSKonf = dfKonf["URL"]
for i in range(0, len(UIDKonf)):
for j in range(0, len(UIDURLS)):
if UIDKonf.at[i] == UIDURLS.at[j]:
URLSKonf.at[i] = URLSurl[j]
The code above does not give me any errors, but I also want it to directly write into the original .csv and not into the Dataframe. How could I achieve that?
Best
If you create the DataFrame with the updated information you want, you can write it back to a csv in pandas using DataFrame.to_csv
I have been trying to look for an approach that will allow me to load only those columns from csv file which satisfy certain condition while I create a DataFrame.. something which can skip the unwanted columns because I have large number of columns and only some are actually useful for testing purposes. And also to load those columns which have mean > 0.0. The ideas is like we skip certain number of rows or read first nrows... but I am looking for condition based filtering for columns' names and values.
Is this actually possible for Pandas? To do things on-fly accumulating results first without loading everything into memory?
There's no direct/easy way of doing that (that i know of)!
The first function idea that comes to mind is: to read the first line of the csv (i.e. read the headers) then create a list using list comprehension for your desired columns :
columnsOfInterest = [ c for c in df.columns.tolist() if 'node' in c]
and get their position in the csv. With that, you'll now have the columns/position so you can only read those from your csv.
However, the second part of your condition which needs to calculate the mean, unfortunately you'll have to read all data for these column, run the mean calculations and then keep those of interest (where mean is > 0). But after all, that's to my level of knowledge, maybe someone else has away of doing this and can help you out, good luck!
I think usecols is what you are looking for.
df = pandas.read_csv('<file_path>', usecols=['col1', 'col2'])
You could preprocess the column headers using the csv library first.
import csv
f = open('data.csv', 'rb')
reader = csv.reader(f)
column_names = next(reader, None)
filtered_columns = [name for name in column_names if 'some_string' in name]
Then proceed using usecols from pandas as abhi mentioned.
df = pandas.read_csv('data.csv', usecols=filtered_columns)
I'm working with large dataframes (15gb) and every time I try to open them it raises a memory error.
I successfully opened dataframe A, which the first column is an ID contained also in dataframe B.
Now, B has many more rows and ID that I don't care and, since I can't filter rows after opening it due to the memory error, I was trying to filter the rows that I need while opening it.
By following this post skip specific line that contains certain value when you read pandas data frame I tried to use:
import StringIO
import pandas as pd
emptylist = []
def read_file(file_name):
with open(file_name, 'r') as fh:
for line in fh.readlines():
parts = line.split(',')
if parts[0] not in emptylist:
emptylist.append(parts[0])
if parts[0] in set(idlist):
yield line
stream = StringIO.StringIO()
stream.writelines(read_file('B.csv'))
stream.seek(0)
df = pd.read_csv(stream)
where emptylist should contain the unique values of dataframe B's ID, and idlist is the column ID of Dataframe A converted to list.
The problem is that it's still giving me memory error at stream.writelines(read_file('B.csv')), and I don't understand why, since the number of rows should be exactly the same of Dataframe A, and the number of columns of B is only 2, against the 3 of dataset A, that instead I can open.
Thank you very much for your help!
It's still the remories error because you still read the whole B.csv to RAM to process. You can use this:
with open("B.csv") as infile:
for line in infile:
do_something_with(line)
It only reads one line at a time. When the next line is read, the previous one will be garbage collected unless you have stored a reference to it somewhere else.
Begin Code
import pandas as pd
df = pd.read_csv('C:/Users/lhicks/Documents/Corporate/test.csv', 'r')
saved_column = df.FirstName
saved_column2 = df.LastName
saved_column3 = df.Email
print saved_column
print saved_column2
print saved_column3
Itemlist = []
Itemlist.append(saved_column)
print Itemlist
End of Code
The objective is to select specific columns from a specified xls sheet, grab all the rows from the specified columns, and then print that data out.
The current issue is the data is grabbed correctly, but after 29-30 rows, it prints/stores a "...", and then jumps to line item 880s, and finishes out from there.
The additional issue is that it also stores this as the value, making it worthless due to not providing the full dataset.
The eventual process is to add the selected columns to a new xls sheet to clean up the old data, and then add the rows to a templated document to generate an advertisement letter.
The first question is how to I have all the fields populate? The second is what is the best approach for this? Please provide additional links as well if possible, this is a practical learning experience for me.
Pandas tries to shorten your data when printing it.
NOTE: all the data is still there (print(df.shape) - to check it, print the shape of your DataFrame), it's just a convenient way not to flood your screen with tons of data rows/columns
Try this:
fn = 'C:/Users/lhicks/Documents/Corporate/test.csv'
cols = ['FirstName','LastName','Email']
df = pd.read_csv(fn, usecols=cols)
df.to_excel('/path/to/excel.xlsx', index=False)
This will parse only ['FirstName','LastName','Email'] columns from a CSV file and will export them to Excel file
UPDATE:
if you want to control how many rows should Pandas print:
with pd.option_context("display.max_rows",200):
print(df)
I crawled a table from html and parsed it into a csv file. Yet the format of the table on the web changed in the middle, yet they didn't update the previous rows, so some of the columns are deprecated. Looks something like this:
the two columns in the red box are deprecated and they should be dropped, and the two columns to the right should replace them. How would I do this in Pandas?
After crawling, the csv file looks like this:
In a nutshell, I want to drop some columns from a certain row, and replace them.
I've met a similar problem, and solved it outside of pandas, and then merged the dataframes corresponding to the two kind of rows:
A = []
B = []
with open(your_file) as f:
for line in f:
if len(line.split(your_separator)) == expected_number_of_columns:
A.append(line.split(your_separator))
else:
B.append(line.split(your_separator))
Here you have stored in the two lists of lists A and B the lines corresponding to the two kind of format in your csv file.
A = pd.DataFrame(A,columns = list_of_columns)
B = pd.DataFrame(B,columns = list_of_columns_2).drop(columns_to_drop,1)
df = pd.concat([A,B]).reset_index(drop = True)