Python, pandas print most frequent 1-1000 from csv - python

I have the following code:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import csv
data1=pd.read_csv('11-01 412-605.csv', low_memory=False)
d412=pd.DataFrame(data1, columns=['size', 'price', 'date'])
new_df = pd.value_counts(d412['size']).reset_index()
new_df.columns = ['size', 'frequency']
print (new_df)
export_csv = new_df.to_csv ('empty.csv', index = None, header=True)
Which outputs:
output
However, I want to print out the values that only have a count of 1-1000. How do I do that because right now it prints out all the values.
I tried:
new_df = pd.value_counts(d412['size']<1000).reset_index()
But that does not work as it prints out true or false for all values less than 1000

try
print(new_df.loc[df_new['frequency']<1000,:])
if i misunderstood the columns of the count please substitute 'frequency' with 'size'

Welcome to Stack Overflow!
As per the reference of Series.value_counts, it's clear value_counts() doesn't allow filtering the values. You can filter the data using DataFrame.loc in a later step, as it's mentioned by others too. So, the following code shall work:
new_df = pd.value_counts(d412['size']).reset_index()
new_df.columns = ['size', 'frequency']
print(new_df.loc[new_df['frequency'] <= 1000])

Related

Invert selection in Pandas

I have two datasets. Below you can see codes and data
import pandas as pd
import numpy as np
pd.set_option('max_columns', None)
import matplotlib.pyplot as plt
data = {'type_sale': ['group_1','group_2','group_3','group_4','group_5','group_6','group_7','group_8','group_9','group_10'],
'id':[70,20,24,80,20,20,60,20,20,20],
}
df1 = pd.DataFrame(data, columns = ['type_sale',
'id',])
data = {'type_sale': ['group_1','group_2','group_3'],
'id':[70,20,24],
}
df2 = pd.DataFrame(data, columns = ['type_sale',
'id',])
These codes created two datasets that are shown below :
Now I want to create a new data set df3 with values from df1 that are different (distinct values) from the values df2 in the column id.
The final results should as pic below
I tried with these codes but are not giving desired results.
df = pd.concat((df1, df2))
print(df.drop_duplicates('id'))
So can anybody help me how to solve this problem?
Try as follows:
Use df.isin to check for each value in df['id'] whether it is contained in df2['id'].
Next, invert the resulting boolean pd.Series by using the unary operator ~ (tilde) and select from d1.
Finally, reset the index.
In a one-liner:
df3 = df1[~df1['id'].isin(df2['id'])].reset_index(drop=True)
print(df3)
type_sale id
0 group_4 80
1 group_7 60

Creating a new column in Pandas

Thank you in advance for taking the time to help me! (Code provided below) (Data Here)
I am trying to average the first 3 columns and insert it as a new column labeled 'Topsoil'. What is the best way to go about doing that?
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
raw_data = pd.read_csv('all-deep-soil-temperatures.csv', index_col=1, parse_dates=True)
df_all_stations = raw_data.copy()
df_selected_station.fillna(method = 'ffill', inplace=True);
df_selected_station_D=df_selected_station.resample(rule='D').mean()
df_selected_station_D['Day'] = df_selected_station_D.index.dayofyear
mean=df_selected_station_D.groupby(by='Day').mean()
mean['Day']=mean.index
#mean.head()
Try this :
mean['avg3col']=mean[['5 cm', '10 cm','15 cm']].mean(axis=1)
df['new column'] = (df['col1'] + df['col2'] + df['col3'])/3
You could use the apply method in the following way:
mean['Topsoil'] = mean.apply(lambda row: np.mean(row[0:3]), axis=1)
You can read about the apply method in the following link: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html
The logic is that you perform the same task along a specific axis multiple times.
Note: It is not wise to call data-structures in names of functions, in your case it might be better be mean_df rather the mean
Use DataFrame.iloc for select by positions - first 3 columns with mean:
mean['Topsoil'] = mean.iloc[:, :3].mean(axis=1)

Pandas category group by category sorting

I need to be able sort the result of Pandas' 2nd groupby by Category.
The first groupby creates a list from another column, and second one is the groupby result I need. The problem is that the 2nd groupby does not honour the original sorted categorical index of the Dataframe
import pandas as pd
import numpy as np
import numpy.ma as ma
from pathlib import Path
fr = Path('../data/rules-1.xlsx')
df = pd.read_excel(fr, sheet_name='MS')
from pandas.api.types import CategoricalDtype
print('Before:')
display(df)
ms_cat = ['Parent-C', 'Parent-A', 'Parent-B']
df['ParentMS'] = df['ParentMS'].astype(CategoricalDtype(list(ms_cat)),order=True)
df = df.reset_index()
df = df.set_index('ParentMS')
df = df.sort_index()
print('After:')
display(df)
df_g = df. groupby(['ParentMS', 'Milestone'])['Tasks'].apply(list)
df_g = df_g.groupby('ParentMS')
# Category sort is not honored after the second groupby()
for name, group in df_g:
print(name, group)
This the input file:
[enter image description here][1]
[1]: https://i.stack.imgur.com/KZnZD.png
Combining the two "df_g" lines did the trick for me. I cannot explain it but it worked
df_g = df.groupby(['ParentMS', 'Milestone'])['RN'].apply(list).groupby('ParentMS')

Need help to solve the Unnamed and to change it in dataframe in pandas

how set my indexes from "Unnamed" to the first line of my dataframe in python
import pandas as pd
df = pd.read_excel('example.xls','Day_Report',index_col=None ,skip_footer=31 ,index=False)
df = df.dropna(how='all',axis=1)
df = df.dropna(how='all')
df = df.drop(2)
To set the column names (assuming that's what you mean by "indexes") to the first row, you can use
df.columns = df.loc[0, :].values
Following that, if you want to drop the first row, you can use
df.drop(0, inplace=True)
Edit
As coldspeed correctly notes below, if the source of this is reading a CSV, then adding the skiprows=1 parameter is much better.

Pandas dropping columns and rows from a dataframe that came from Excel

I am trying to drop some useless columns in a dataframe but I am getting the error: "too many indices for array"
Here is my code :
import pandas as pd
def answer_one():
energy = pd.read_excel("Energy Indicators.xls")
energy.drop(energy.index[0,1], axis = 1)
answer_one()
Option 1
Your syntax is wrong when slicing the index and it should be the columns
import pandas as pd
energy = pd.read_excel("Energy Indicators.xls")
energy.drop(energy.columns[[0,1]], axis=1)
Option 2
I'd do it like this
import pandas as pd
energy = pd.read_excel("Energy Indicators.xls")
energy.iloc[:, 2:]
I think it's better to skip unneeded columns when parsing/reading Excel file:
energy = pd.read_excel("Energy Indicators.xls", usecols='C:ZZ')
If you're trying to drop the column need to change the syntax. You can refer to them by the header or the index. Here is how you would refer to them by name.
import pandas as pd
energy = pd.read_excel("Energy Indicators.xls")
energy.drop(['first_colum', 'second_column'], axis=1, inplace=True)
Another solution would be to exclude them in the first place:
energy = pd.read_excel("Energy Indicators.xls", usecols=[2:])
This will help speed up the import as well.

Categories