Reading a particular column from a csv - python

I am currently trying to read in a csv file for the purpose of creating a budget from a stament and I want to group similar items eg fuel etc. So id like to get the values from column E (aka column 5). store these values in a list and pair them with cost and then group in to lumps eg fuel. So far for simply trying to read the column I have the following
temp=pd.read_csv("statement.csv",usecols=['columnE'])
print(temp)
and the following table:
Values removed for obvious reasons. However when I run this I get the error Usecols do not match columns, why is this? I assumed I would at least get a value even if it's not the right one.

Correct the column name to
temp=pd.read_csv("statement.csv",usecols=['Transaction Description'])
and try again

Related

Is there a way to parse each unique value of a column into individal CSV's?

EDIT: Creating files working, removing columns is not
EDIT2: ALL WORKING! Need help with combining two columns into one key. Is it possible to take two columns, state and county, and then combine them into a state-county key?
I have a COVID-19 data set that I am trying to create tables with. Currently, I have one large dump file from the government github page.
Basically, I am attempting to take every unique value of row State, and create a new csv with the respective columns, only for that state.
So if Arizona has 4 data entries, it would create a new CSV with those four entries.
The sample data set I am retrieving from can be found here.
As we can see, the columns contain identifiers, state names, dates, etc.
I am looking to take each individual state, and create a new csv with all the values for that state including state, country, and the dates from 3/23-3/29.
This is a sample of what the data would look like after it is parsed:
What I believe needs to happen
What I have been working on is parsing out the unique values for the state column, which i did simply through
data=pd.read_csv('deaths.csv')
print (data['Province_State'].unique())
Now, I am trying to figure out how to select specific column, and write the values for the unique states (including all counties for that same state)
Any help would be greatly appreciated!
EDIT:
Here's what I've tried:
def createCSV():
data=pd.read_csv('deaths.csv', delimiter = ',')
data.drop([0,1,2,3,4,5,6,7,8,9,10])
data = data.set_index('Province_State')
data = data.rename(columns=pd.to_datetime)
for name, g in data.groupby(level='Province_State'):
g[pd.date_range('3/23/2020', '3/29/20')] \
.to_csv('{0}.csv'.format(name))
However with this, I get unknown string format for the columns that don't have dates. However, I attempted to drop them based off index, which didn't seem to do anything.
Manually deleting the columns allows for the function i am looking for, but i need to delete the columns with panda for time.
For saving by state:
for name, g in data.groupby('Province_State'):
g.to_csv('{0}.csv'.format(name))
For saving by state while only using certain dates:
data = data.set_index('Province_State')
data = data.rename(columns=pd.to_datetime)
for name, g in data.groupby(level='Province_State'):
g[pd.date_range('3/23/2020', '3/29/20')] \
.to_csv('{0}.csv'.format(name))
This assumes that the only columns are the region name and the dates. If this isn't the case, remove the non-date columns prior to converting them to datetimes.

Iteration & Computation Pandas Dataframe

As a very very new beginner with Python & Pandas, I am looking for your support regarding an issue.
I need to iterate over columns and find out the maximum value in the concerning rows of a dataframe and write it in a new variable for each row. The number of columns is not manageable, almost 200 columns, therefore I do not want to write each required column id manually. And most importantly that I need to start from a given column id and continue with two columns id increments till a given last columns id.
Will appreciate sample codes, see attachment too.
Try:
df['x']=df.max(axis=1)
Replace x with the name for your desired output column.

How to add new values to dataframe's columns based on specific row without overwrite existing data

I have a batch of identifier and a pair of values that behave in following manner within an iteration.
For example,
print(indexIDs[i], (coordinate_x, coordinate_y))
Sample output looks like
I would like to add these data into dataframe, where I can use indexIDs[i] as row and append incoming pair of values with same identifier in the next consecutive columns
I have attempted to perform following code, which didn't work.
spatio_location = pd.DataFrame()
spatio_location.loc[indexIDs[i], column_counter] = (coordinate_x, coordinate_y)
It was an ideal initial to associate indexIDs[i] as row, however I could not progress to take incoming data without overwriting previous dataframe. I am aware it has something to do with the second line which uses "=" sign.
I am aware my second line is keep overwriting previous result over and over again. I am looking for an appropriate way change my second line to insert new incoming data to existing dataframe without overwriting from time to time.
Appreciate your time and effort, thanks.
I'm a bit confuesed from the nature of coordinate_x (is it a list or what?) anyway maybe try to use append
you could define an empty df with three columns
df=pd.DataFrame([],columns=['a','b','c'])
after populate it with a loop on your lists
for i in range TOFILL:
df=df.append({'a':indexIDs[i],'b':coordinate_x[i],'c':coordinate_y[i]},ignore_index=True)
finally set a columns as index
df=df.set_index('a')
hope it helps

Discard rows in dataframe if particular column values in list [duplicate]

I have a dataframe customers with some "bad" rows, the key in this dataframe is CustomerID. I know I should drop these rows. I have a list called badcu that says [23770, 24572, 28773, ...] each value corresponds to a different "bad" customer.
Then I have another dataframe, lets call it sales, so I want to drop all the records for the bad customers, the ones in the badcu list.
If I do the following
sales[sales.CustomerID.isin(badcu)]
I got a dataframe with precisely the records I want to drop, but if I do a
sales.drop(sales.CustomerID.isin(badcu))
It returns a dataframe with the first row dropped (which is a legitimate order), and the rest of the rows intact (it doesn't delete the bad ones), I think I know why this happens, but I still don't know how to drop the incorrect customer id rows.
You need
new_df = sales[~sales.CustomerID.isin(badcu)]
You can also use query
sales.query('CustomerID not in #badcu')
I think the best way is to drop by index,try it and let me know
sales.drop(sales[sales.CustomerId.isin(badcu)].index.tolist())

Grpahlab SFrames: Error in using SFrames with the dataset

In Graphlab,
I am working with small set of fitness data, to use recommender functions that could provide recommendations. The dataset has userid's column but not item id's, instead different items arranged in columns and their respective ratings in rows corresponding to each userid. In order to use any graphlab recommender method, I need to have userid's and item id's. Here is what I did:
v = graphlab.SFrame.read_csv('Data.csv')
userId = v["user_id"]
itemId = v["x","y","z","x1","y1","z1"] //x,y,z,x1,y1,z1 are activities that are actually the columns in Data and contains corresponding ratings given by user
sf= graphlab.SFrame({'UserId':userId,'ItemId':itemId})
print sf.head(5)
Basically, i extracted the user_id col from Data and tried making a column for ItemId using the x,y,z,etc columns extracted from the same data in order to make another sframe with just these 2 columns. This code results in a tabular format sframe with 2 column as expected, but not arranged in the same order I pass arguments in SFrame. So, the output gives ItemId as the first column and then UserId. Even though I tried to change the order of passing these 2 in sframe, it still gives the same output. Does anyone know the reason why ?
This is creating a problem further when using any recommender method as it gives the error: Column name user_id does not exist.
The reason for the column ordering is because you are passing a Python dictionary to the SFrame constructor. Dictionaries in Python will not keep keys in the order they are specified; they have their own order. If you prefer "UserId" to be first, you can call sf.swap_columns('UserId','ItemId').
The order of the columns does not affect the recommender method though. The Column name 'user_id' does not exist error will appear if you don't have a column named exactly user_id AND don't specify what the name of the user_id column is. In your case, you would want to do: graphlab.recommender.create(sf, user_id='UserId', item_id='ItemId').
Also, you may want to look at the stack method, which could help get your data in to the form the recommender method expects. Your current SFrame sf I think will have a column of dictionaries where the item id is the key and the rating is the value. I believe this would work in this case:
sf.stack('ItemId', new_column_name=['ItemId','Rating'])

Categories