pandas module to trim columns in python - python

Any idea why below code can't keep the first column of my csv file? I would like to keep several columns in a new csv file, first column included. And if I select the name of first column to be on new file.
I get an error :
"Type" not index.
import pandas as pd
f = pd.read_csv("1.csv")
keep_col = ['Type','Pol','Country','User Site Code','PG','Status']
new_f = f[keep_col]
new_f.to_csv("2.csv", index=False)
Thanks a lot.

Try f.columns.values.tolist() and check the output of the first column. It sounds like there is an encoding issue when you are reading the CSV. You can try specifying the "encoding" option in your pd.read_csv() to see if that will get rid of the extra characters at the front. Otherwise, you can use f.rename(columns={'F48FBFBFType':'Type'} to change whatever the current name of your first column is to simply be 'Type'.

You are better off by specifying the columns to read from your csv file.
pd.read_csv('1.csv', names=keep_col).to_csv("2.csv", index=False)
Do you have any special characters in your first column?

Related

Remove Unnamed Columns in pandas

I am working on an excel file and the pandas shows the excel file like this.
How do i get rid of all Unnamed rows ?
This will do the trick
remove_cols = [col for col in gd.columns if 'Unnamed' in col]
gd.drop(remove_cols, axis='columns', inplace=True)
Looking at the result you are getting, the Excel data doesn't start on the first row. It also starts in column B instead of column A.
If you are able to edit the Excel file, I would recommend starting your data at A1 (by removing the empty column A and the empty rows at the top using Excel), as that will make later processing much easier for everyone reading the file.
If this file is not editable (perhaps it is generated by another party), you will need to skip the first couple of rows to read the correct headings:
gd = pd.read_excel(r"D:\gdp.xlsx", skiprows=3, usecols="B:L")

Using python pandas how can we select very specific rows and associated column

I am still learning python, kindly excuse if the question looks trivial to some.
I have a csv file with following format and I want to extract a small segment of it and write to another csv file:
So, this is what I want to do:
Just extract the entries under actor_list2 and the corresponding id column and write it to a csv file in following format.
Since the format is not a regular column headers followed by some values, I am not sure how to select starting point based on a cell value in a particular column.e.g. even if we consider actor_list2, then it may have any number of entries under that. Please help me understand if it can be done using pandas dataframe processing capability.
Update: The reason why I would like to automate it is because there can be thousands of such files and it would be impractical to manually get that info to create the final csv file which will essentially have a row for each file.
As Nour-Allah has pointed out the formatting here is not very regular to say the least. The best you can do if that is the case that your data comes out like this every time is to skip some rows of the file:
import pandas as pd
df = pd.read_csv('blabla.csv', skiprows=list(range(17)), nrows=8)
df_res = df.loc[:, ['actor_list2', 'ID']]
This should get you the result but given how erratic formatting is, this is no way to automate. What if next time there's another actor? Or one fewer? Even Nour-Allah's solution would not help there.
Honestly, you should just get better data.
As the CSV file you have is not regular, so a lot of empty position, that contains 'nan' objects. Meanwhile, the columns will be indexed.
I will use pandas to read
import pandas as pd
df = pd.read_csv("not_regular_format.csv", header=None)
Then, initialize and empty dictionary to store the results in, and use it to build an output DataFram, which finally send its content to a CSV file
target={}
Now you need to find actor_list2 in the second columns which is the column with the index 0, and if it exists, start store the names and scores from in the next rows and columns 1 and 2 in the dictionary target
rows_index = df[df[1] == 'actor_list2'].index
if len(rows_index) > 0:
i = rows_index[0]
while True:
i += 1
name = df.iloc[i, 1]
score = df.iloc[i, 2]
if pd.isna(name): # the names sequence is finished and 'nan' object exists.
break
target[name] = [score]
and finally, construct DataFrame and write the new output.csv file
df_output=pd.DataFrame(target)
df_output.to_csv('output.csv')
Now, you can go anywhere with the given example above.
Good Luck

How do you separate the Column names and its values in Pandas?

I wanted to import this [dataset][1] named "wind.data" to perform some operations on it but I couldn't find a way to turn it into a proper table-like structure.
This is how it's looking like after importing:
wind dataframe.
I tried using sep=' ' parameter in pd.read_csv('wind.data', sep=' ') but it's not working.
How do I separate the column names and their respective values from this dataset?
[1]:
The file is not comma (or any other character) separated but is fixed width formatted.
Instead of trying to force read_csv to handle it correctly, you should use read_fwf.
df = pd.read_fwf("wind.data", header=1)
Try:
pd.read_csv('wind.data', delimiter=r'\s+')
Because there is not always a single space between columns.

Read specific excel cell value without loading the whole sheet to dataframe?

I know how to read the whole dataset, I know how to read a part of it, but it always reads all of the columns from my excel file. I do it like this:
myfile = pd.ExcelFile('my_file.xlsx')
myfile.parse(2, skiprows=14, skipfooter= 2).dropna(axis=1, how='all')
But I can not read only one specific cell this way, because it read the whole row. Is there a way to limit the parser to one column?
UPDATE:
looking for a Pandas solution
Update your pandas to 0.24.2:
Docs: read_excel, specifically read usecols
I believe you will need to use a combination of skiprows and skipfooter to narrow down to specific row and usecols to get the column. This way you will get the specific cells value.

Pandas read_csv silently converting and messing up dates and strings?

I am reading a csv file that has two adjacent columns containing dates like this:
29/11/2004 00:00,29/11/2005 00:00,2,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL
When I read this using read_csv and then write it back to csv using the to_csv method, it gets converted to
29/11/2004 00:00,00:00.0,2.0,,,,,,,,
I have got two questions about this: Why does it read the first date okay but thinks the second, which seems to have exactly the same format, is 0? And why do the NULLs get converted to empty strings?
Here is the code I am using:
df = pandas.read_csv(filepath, sep = ",")
df.to_csv("C:\\tmp\\test.csv")
Not sure the reason for the missing date. I think it's influenced by other rows.
For the NULL string problem, keep_default_na can help you to avoid that:
df = pd.read_csv('test.csv', sep=',', keep_default_na=False)

Categories