I want to filter a rather large Pandas dataframe (about 3 million rows) by date.
For some reason the drop method when used with boolean criteria does not work at all. It just returns the same old dataframe. Dropping single rows is no problem though.
This is the code is used initially, which essentially does nothing at all:
import pandas as pd
#open the file
df = pd.read_csv('examplepath/examplefile.csv', names=['File Name','FileSize','File Type','Date Created','Date Last Accessed','Date Last Modified','Path'],\
delimiter=';', header=None, encoding="ISO-8859-1",)
#convert to german style date
df['Date Created'] = pd.to_datetime(df['Date Created'], dayfirst=True)
#drop rows and assign new dataframe
df_filtered = df.drop(df[df['Date Created'] > datetime(2010,1,1)])
I then came up with this code, which seemingly works like a charm:
import pandas as pd
#open the file
df = pd.read_csv('examplepath/examplefile.csv', names=['File Name','FileSize','File Type','Date Created','Date Last Accessed','Date Last Modified','Path'],\
delimiter=';', header=None, encoding="ISO-8859-1",)
#convert to german style date
df['Date Created'] = pd.to_datetime(df['Date Created'], dayfirst=True)
#select rows and assign new dataframe
df_filtered = df['Date Created'] < datetime(2010,1,1)
Both codes in theory should do the same thing, right?
Is one of the codes to be preferred? Can I just work with my second code? In the future I may have to add a second filterdate.
I hope someone can help me.
Thanks and best regards,
Stefan
You've got to give index list or column names to 'drop' either rows or columns respectively.
Read docs and examples given.
Your second approach works because that is the way you filter a dataframe.
You may use it at will.
Related
Using python pandas how can we change the data frame
First, how to copy the column name down to other cell(blue)
Second, delete the row and index column(orange)
Third, modify the date formate(green)
I would appreciate any feedback~~
Update
df.iloc[1,1] = df.columns[0]
df = df.iloc[1:].reset_index(drop=True)
df.columns = df.iloc[0]
df = df.drop(df.index[0])
df = df.set_index('Date')
print(df.columns)
Question 1 - How to copy column name to a column (Edit- Rename column)
To rename a column pandas.DataFrame.rename
df.columns = ['Date','Asia Pacific Equity Fund']
# Here the list size should be 2 because you have 2 columns
# Rename using pandas pandas.DataFrame.rename
df.rename(columns = {'Asia Pacific Equity Fund':'Date',"Unnamed: 1":"Asia Pacific Equity Fund"}, inplace = True)
df.columns will return all the columns of dataframe where you can access each column name with index
Please refer Rename unnamed column pandas dataframe to change unnamed columns
Question 2 - Delete a row
# Get rows from first index
df = df.iloc[1:].reset_index()
# To remove desired rows
df.drop([0,1]).reset_index()
Question 3 - Modify the date format
current_format = '%Y-%m-%d %H:%M:%S'
desired_format = "%Y-%m-%d"
df['Date'] = pd.to_datetime(df['Date']).dt.strftime(desired_format)
# Input the existing format
df['Date'] = pd.to_datetime(df['Date'], infer_datetime_format=current_format).dt.strftime(desired_format)
# To update date format of Index
df.index = pd.to_datetime(df.index,infer_datetime_format=current_format).strftime(desired_format)
Please refer pandas.to_datetime for more details
I'm not sure I understand your questions. I mean, do you actually want to change the dataframe or how it is printed/displayed?
Indexes can be changed by using methods .set_index() or .reset_index(), or can be dropped eventually. If you just want to remove the first digit from each index (that's what I understood from the orange column), you should then create a list with the new indexes and pass it as a column to your dataframe.
Regarding the date format, it depends on what you want the changed format to become. Take a look into python datetime.
I would strongly suggest you to take a better look into pandas features and documentations, and how to handle a dataframe with this library. There is plenty of great sources a Google-search away :)
Delete the first two rows using this.
Rename the second column using this.
Work with datetime format using the datetime package. Read about it here
I'm trying to create a DataFrame out of two existing ones. I read the title of some articles in the web, first column is title and the ones after are timestamps
i want to concat both data frames but leave out the ones with the same title (column one)
I tried
df = pd.concat([df1,df2]).drop_duplicates().reset_index(drop=True)
but because the other columns may not be the exact same all the time, I need to leave out every data pack that has the same first column. how would I do this?
btw sorry for not knowing all the right terms for my problem
You should first remove the duplicate rows from df2 and then concat it with df1:
df = pd.concat([df1, df2[~df2.title.isin(df1.title)]]).reset_index(drop=True)
This probably solves your problem:
import pandas as pd
import numpy as np
df=pd.DataFrame(np.arange(2*5).reshape(2,5))
df2=pd.DataFrame(np.arange(2*5).reshape(2,5))
df.columns=['blah1','blah2','blah3','blah4','blah']
df2.columns=['blah5','blah6','blah7','blah8','blah']
for i in range(len(df.columns)):
for j in range(len(df2.columns)):
if df.columns[i] == df2.columns[j]:
df2 = df2.drop(df2.columns[j], axis = 1)
else:
continue
print(pd.concat([df, df2], axis =1))
Here below is the CSV file that I'm working with:
I'm trying to get my hands on the enj coin: (United States) column. Nonetheless when I try printing all of the columns of the DataFrame it doesn't appear to be treated as a column
Code:
import pandas as pd
df = pd.read_csv("/multiTimeline.csv")
print(df.columns)
I get the following output:
Index(['Category: All categories'], dtype='object')
I've tried accessing the column with df['Category: All categories']['enj coin: (United States)'] but sadly it doesn't work.
Question:
Could someone possibly explain to me how I could possibly transform this DataFrame (which has only one column Category: All categories) into a DataFrame which has two columns Time and enj coin: (United States)?
Thank you very much for your help
Try using the parameter skiprows=2 when reading in the CSV. I.e.
df = pd.read_csv("/multiTimeline.csv", skiprows=2)
The csv looks good.
Ignore the complex header at the top.
pd.read_csv(csvdata, header=[1])
The entire header can be taken in as well, although it is not delimited as the data is.
import pandas as pd
from pandas.compat import StringIO
print(pd.__version__)
csvdata = StringIO("""Category: All categories
Time,enj coin: (United States)
2019-04-10T19,7
2019-04-10T20,20""")
df = pd.read_csv(csvdata, header=[0,1])
print(df)
0.24.2
Category: All categories
Time
2019-04-10T19 7
2019-04-10T20 20
how set my indexes from "Unnamed" to the first line of my dataframe in python
import pandas as pd
df = pd.read_excel('example.xls','Day_Report',index_col=None ,skip_footer=31 ,index=False)
df = df.dropna(how='all',axis=1)
df = df.dropna(how='all')
df = df.drop(2)
To set the column names (assuming that's what you mean by "indexes") to the first row, you can use
df.columns = df.loc[0, :].values
Following that, if you want to drop the first row, you can use
df.drop(0, inplace=True)
Edit
As coldspeed correctly notes below, if the source of this is reading a CSV, then adding the skiprows=1 parameter is much better.
I have the following code which imports a CSV file. There are 3 columns and I want to set the first two of them to variables. When I set the second column to the variable "efficiency" the index column is also tacked on. How can I get rid of the index column?
df = pd.DataFrame.from_csv('Efficiency_Data.csv', header=0, parse_dates=False)
energy = df.index
efficiency = df.Efficiency
print efficiency
I tried using
del df['index']
after I set
energy = df.index
which I found in another post but that results in "KeyError: 'index' "
When writing to and reading from a CSV file include the argument index=False and index_col=False, respectively. Follows an example:
To write:
df.to_csv(filename, index=False)
and to read from the csv
df.read_csv(filename, index_col=False)
This should prevent the issue so you don't need to fix it later.
df.reset_index(drop=True, inplace=True)
DataFrames and Series always have an index. Although it displays alongside the column(s), it is not a column, which is why del df['index'] did not work.
If you want to replace the index with simple sequential numbers, use df.reset_index().
To get a sense for why the index is there and how it is used, see e.g. 10 minutes to Pandas.
You can set one of the columns as an index in case it is an "id" for example.
In this case the index column will be replaced by one of the columns you have chosen.
df.set_index('id', inplace=True)
If your problem is same as mine where you just want to reset the column headers from 0 to column size. Do
df = pd.DataFrame(df.values);
EDIT:
Not a good idea if you have heterogenous data types. Better just use
df.columns = range(len(df.columns))
you can specify which column is an index in your csv file by using index_col parameter of from_csv function
if this doesn't solve you problem please provide example of your data
One thing that i do is df=df.reset_index()
then df=df.drop(['index'],axis=1)
To remove or not to create the default index column, you can set the index_col to False and keep the header as Zero. Here is an example of how you can do it.
recording = pd.read_excel("file.xls",
sheet_name= "sheet1",
header= 0,
index_col= False)
The header = 0 will make your attributes to headers and you can use it later for calling the column.
It works for me this way:
Df = data.set_index("name of the column header to start as index column" )