I have the following code
data
for label, content in data.items():
print('label:', label)
print('content:', content, sep='\n')`
That is all. IGNORE IT
You can subset by index
data2 = data.loc[(data.index.month == 11) & (data.index.day == 10)]
You index is datetime type, and you want it converted to string. First we need to reset_index
data2 = data2.reset_index()
data2["Date"] = data2["Date"].astype(str)
Then get the rows as list of lists (your example output)
data2.values.tolist()
You would need to split the date into three columns :"Year","Month","Day".
Then simply do a selection like
newdf=df[df["Day"]==20 & df["Month"]==11]
Then you can perform your analysis on this new df.
To translate the code it means "copy all the values from df where Month is 11 and Day is 20"
Related
I have a dataframe where it was added date and datetime information to a column where it was expected a string. What would be the best way to filter all dates and date values from a pandas dataframe column and replace those values to blank?
Thank you!
In general, if you provided a minimum working example of your problem, one could help more specifically, but assuming you have the following column:
df = pd.DataFrame(np.zeros(shape=(10,1)), columns = ["Mixed"])
df["Mixed"] = "foobar"
df.loc[2,"Mixed"] = pd.to_datetime("2022-08-22")
df.loc[7,"Mixed"] = pd.to_datetime("2022-08-21")
#print("Before Fix", df)
You can use apply(type) on the column to obtain the data-types of each cell and then use list comprehension [x!=str for x in types] to check for each cells datatype if its a string or not. After that, just replace those values that are not the desired datatype with a value of your choosing.
types = df["Mixed"].apply(type).values
mask = [x!=str for x in types]
df.loc[mask,"Mixed"] = "" #Or None, or whatever you want to overwrite it with
#print("After Fix", df)
Assume I have the following data frame:
I want to create two data frames such that for any row if column Actual is equal to column Predicted then the value in both columns goes in one data frame otherwise both columns go in another data frame.
For example, row 0,1,2 goes in dataframe named correct_df and row 245,247 goes in dataframe named incorect_df
Use boolean indexing:
m = df['Actual'] == df['Predicted']
correct_df = df.loc[m]
incorrect_df = df.loc[~m]
You can use this :
df_cor = df.loc[(df['Actual'] == df['Predicted'])]
df_incor = df2 = df.loc[(df['Actual']!= df['Predicted'])]
And use reset_index if you want a new index.
From the datetime object in the dataframe I created two new columns based on month and day.
data["time"] = pd.to_datetime(data["datetime"])
data['month']= data['time'].apply(lambda x: x.month)
data['day']= data['time'].apply(lambda x: x.day)
The resultant data had the correct month and day added to the specific columns.
Then I tried to filter it based on specific day
data = data[data['month']=='9']
data = data[data['day']=='2']
This values were visible in the dataframe before filtering.
This returns an empty dataframe. What did I do wrong?
Compare by 9,2 like integers, without '':
data = data[(data['month']==9) & (data['day']==2)]
Or:
data = data[(data['time'].dt.month == 9) & (data['time'].dt.day == 2)]
Actually there is no csv file. I have parquet files. In that I need to extract data from three tables. The tables are publication,section and alt section tables
As you can see from the images, I need the following outputs
I have a dataframe like this as shown in the screenshot.
I need to get the following output as a dataframe
pub number std kw1 stdkw2
---------------------------
1078143 T. Art.
Like this if there are 3 or more values for the same number it should take all of them as stdkw1,stdkw2,stdkw3 etc..
Group the dataframe by pub_number. Then iterate over the groups. Append the std_section_name values with pub_number to the result list. Create a dataframe with data from the result list. Later add column names to the dataframe.
df = pd.DataFrame([[3333,1078143,'T.'],[3333,1078143,'ssss'],[3334,1078145,'Art'],[3334,1078145,'Art'],[3334,1078145,'Art'],[3334,1078145,'Art'],[3334,1078143,'team']],columns=['section_id','pub_number','std_section_name'])
result = list()
for name,group in df.groupby(by = ['pub_number']):
if group.shape[0] < 3:
continue
result.append([name] + group['std_section_name'].tolist())
ref = pd.DataFrame(result)
ref.columns = ["pub_number"] + ["stdkw" + str(i) for i in range(1,ref.shape[1])]
print(ref)
I am coming from an R background and used to being able to retrieve the value from a dataframe by using syntax like:
r_dataframe$some_column_name[row_number]
And I can assign a value to the dataframe by the following syntax:
r_dataframe$some_column_name[row_number] <= some_value
or without the arrow:
r_dataframe$some_column_name[row_number] = some_value
For example:
#create R dataframe data
employee <- c('John Doe','Peter Gynn','Jolie Hope')
salary <- c(21000, 23400, 26800)
startdate <- as.Date(c('2010-11-1','2008-3-25','2007-3-14'))
employ.data <- data.frame(employee, salary, startdate)
#print out the name of this employee
employ.data$employee[2]
#assign the name
employ.data$employee[2] <= 'Some other name'
I'm now learning some Python and from what I can see the most straight-forward way to retreive a value from a pandas dataframe is:
pandas_dataframe['SomeColumnName'][row_number]
I can see the similarities to R.
However, what confuses me is that when it comes to modifying/assigning the value in the pandas dataframe I need to completely change the syntax to something like:
pandas_dataframe.at[row_number, 'SomeColumnName'] = some_value
To read this code is going to require a lot more concentration because the column name and row number have changed order.
Is this the only way to perform this pair of operations? Is there a more logical way to do this that respects the consistent use of column name and row number order?
If I understand what you mean correctly, as #sammywemmy mentioned you can use .loc and .iloc to get/change value in any row and column.
If the order of your dataframe rows changes, you must define index to get every row (datapoint) by its index, even if the order has changed.
Like below:
df = pd.DataFrame(index=['a', 'b', 'c'], columns=['time', 'date', 'name'])
Now you can get the first row by its index:
df.loc['a'] # equivalent to df.iloc[0]
It turns out that pandas_dataframe.at[row_number, 'SomeColumnName'] can be used to modify AND retrieve information.