Pandas Filterting Data - python

I have a dataset from Bing. And this dataset contains state and county-level information. I am trying to create two different datasets, one for the county level and one for the state level.
How do I create an only-state level data frame. Here is a picture of what the dataset looks like:
The counties dataframe worked with this code
import pandas as pd
df = pd.read_csv("COVID19-DATA-WITHSTATES.csv")
only_counties = df[df['AdminRegion2'].str.contains("", na = True)]
It didn't work for the state level with this code:
only_states = df[df['AdminRegion2' != ""]]
EDIT: This is the code that worked
only_states = usa_only[lambda x: ~pd.notnull(x['AdminRegion2']) & (usa_only["AdminRegion1"].str.contains("", na = False))]

You can filter it with lambda expression:
only_states = df[lambda x: ~pd.isnull(x['AdminRegion2'])]
For the second question the above solution works as well:
df[lambda x: x['date'] == "date"]

Here is the answer for only states, that worked.
only_states = usa_only[lambda x: ~pd.notnull(x['AdminRegion2']) & (usa_only["AdminRegion1"].str.contains("", na = False))]

Related

Creating a new dataset with conditions on the current data in Python [duplicate]

For example:
I have this code:
import pandas
df = pandas.read_csv('covid_19_data.csv')
this dataset has a column called countryterritoryCode which is the country code of the country.sample data from the dataset
This dataset has information about covid19 cases from all the countries in the world.
How do I create a new dataset where only the USA info appears
(where countryterritoryCode == USA)
import pandas
df = pandas.read_csv('covid_19_data.csv')
new_df = df[df["country"] == "USA"]
or
new_df = df[df.country == "USA"]
Use df.groupby:
df = pandas.read_csv('covid_19_data.csv')
df_new = df.groupby('countryterritoryCode', axis = 1)

loop over columns in dataframes python

I want to loop over 2 columns in a specific dataframe and I want to access the data by the name of the column but it gives me this error (type error) on line 3
i=0
for name,value in df.iteritems():
q1=df[name].quantile(0.25)
q3=df[name].quantile(0.75)
IQR=q3-q1
min=q1-1.5*IQR
max=q3+1.5*IQR
minout=df[df[name]<min]
maxout=df[df[name]>max]
new_df=df[(df[name]<max) & (df[name]>min)]
i+=1
if i==2:
break
It looks like you want to exclude outliers based on the 1.5*IQR rule. Here is a simpler solution:
Input dummy data:
import numpy as np
np.random.seed(0)
df = pd.DataFrame({'col%s' % (i+1): np.random.normal(size=1000)
for i in range(4)})
Removing the outliers (keep data: Q1-1.5IQR < data < Q3+1.5IQR):
Q1 = df.iloc[:, :2].quantile(.25)
Q3 = df.iloc[:, :2].quantile(.75)
IQR = Q3-Q1
non_outliers = (df.iloc[:, :2] > Q1-1.5*IQR) & (df.iloc[:, :2] < Q3+1.5*IQR)
new_df = df[non_outliers.all(axis=1)]
output:
Type error might happen for a lot of reasons so it will be better if you add part of the DF to try to understand the issue.
Also to loop over columns you can also use the iterrows() function:
import pandas as pd
df = pd.read_csv('filename.csv')
for _, content in df.iterrows():
print(content['columnname']) #add the name of the columns you want to loop over
refer to the following link for more information
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iterrows.html#pandas.DataFrame.iterrows

Is there any method to replace specific data from column without breaking its structure or spliting

Hi there i am trying to figure out how to replace a specific data of csv file. i have a file which is base or location data of id's.
https://store8.gofile.io/download/5b031959-e0b0-4dbf-aec6-264e0b87fd09/service%20block.xlsx (sheet 2 had data ).
The file which i want to replace data using id is below
https://store8.gofile.io/download/6e13a19a-bac8-4d16-8692-e4435eed2a08/Serp.csv
Highlighted part need to be deleted after filling location.
import pandas as pd
df1= pd.read_excel("serp.xlsx", header=None)
df2= pd.read_excel("flocnam.xlsx", header=None)
df1 = df1[0].str.split(";", expand=True)
df1[4] = df1[4].apply(lambda x: v[-1] if (v := x.split()) else "")
df2[1] = df2[1].apply(lambda x: x.split("-")[0])
m = dict(zip(df2[1], df2[0]))
df1[4]= df1[4].replace(m)
print(df1)
df1.to_csv ("test.csv")
It worked but not how i wanted.
https://store8.gofile.io/download/c0ae7e05-c0e2-4f43-9d13-da12ddf73a8d/test.csv
trying to replace it like this.(desired output)
Thank you for being Supportive community❤️
If I understand correctly, you simply need to specify the separator ;
>>> df.to_csv(‘test.csv’, sep=‘;’, index_label=False)

How to split out tagged column from df and fill new column?

I have a column in a Python df like:
TAGS
{user_type:active}
{session_type:session1}
{user_type:inactive}
How can I efficiently make this column its own column for each of the tags specified?
Desired:
TAGS |user_type|session_type
{user_type:active} |active |null
{session_type:session1}|null |session1
{user_type:inactive} |inactive |null
My attempt only is able to do this in a boolean sense (not what I want) and only if I specify the columns from the tags (which I don't know ahead of time):
mask = df['tags'].apply(lambda x: 'user_type' in x)
df['user_type'] = mask
there are better ways but this is from what you got
df['user_type'] = df['tags'].apply(lambda x: x.split(':')[1] if 'user_type' in x else np.nan)
df['session_type'] = df['tags'].apply(lambda x: x.split(':')[1] if 'session_type' in x else np.nan)
You could use pandas.json_normalize() to convert TAGS column to dict object and check if user_type is a key of that dict.
df2 = pd.json_normalize(df['TAGS'])
df2['user_type'] = df2['TAGS'].apply(lambda x: x['user_type'] if 'user_type' in x else 'null')
This is what ended up working for me, wanted to post a short working example from the json library the helped.
def js(row):
if row:
return json.loads(row)
else:
return {'':''}
#This example includes if there was/wasn't a dataframe with other fields including tags
import json
import pandas as pd
df2 = df.copy()
#Make some dummy tags
df2['tags'] = ['{"user_type":"active","nonuser_type":"inactive"}']*len(df2['tags'])
df2['tags'] = df2['tags'].apply(js)
df_temp = pd.DataFrame(df2['tags'].values.tolist())
df3 = (pd.concat([df2.drop('tags', axis=1), df_temp], axis=1))
#Ynjxsjmh your approach reminds me of something like that I had used in the past, but in this case, I had gotten the following error:
AttributeError: 'str' object has no attribute 'values'
#Bing Wang I am a big fan of list comprehension but in this case I don't know the names of the columns before hand.

numpy under a groupby not working

I have the following code:
import pandas as pd
import numpy as np
df = pd.read_csv('C:/test.csv')
df.drop(['SecurityID'],1,inplace=True)
Time = 1
trade_filter_size = 9
groupbytime = (str(Time) + "min")
df['dateTime_s'] = df['dateTime'].astype('datetime64[s]')
df['dateTime'] = pd.to_datetime(df['dateTime'])
df[str(Time)+"min"] = df['dateTime'].dt.floor(str(Time)+"min")
df['tradeBid'] = np.where(((df['tradePrice'] <= df['bid1']) & (df['isTrade']==1)), df['tradeVolume'], 0)
groups = df[df['isTrade'] == 1].groupby(groupbytime)
print("groups",groups.dtypes)
#THIS IS WORKING
df_grouped = (groups.agg({
'tradeBid': [('sum', np.sum),('downticks_number', lambda x: (x > 0).sum())],
}))
# creating a new data frame which is filttered
df2 = pd.DataFrame( df.loc[(df['isTrade'] == 1) & (df['tradeVolume']>=trade_filter_size)])
#recalculating all the bid/ask volume to be bsaed on the filter size
df2['tradeBid'] = np.where(((df2['tradePrice'] <= df2['bid1']) & (df2['isTrade']==1)), df2['tradeVolume'], 0)
df2grouped = (df2.agg({
# here is the problem!!! NOT WORKING
'tradeBid': [('sum', np.sum), lambda x: (x > 0).sum()],
}))
The same function is used tradeBid': [('sum', np.sum),('downticks_number', lambda x: (x > 0).sum()). In the first time it's working ok but when doing it on filtered data in a new df it's causing an error:
ValueError: downticks_number is an unknown string function
when I use this code instead to solve the above
'tradeBid': [('sum', np.sum), lambda x: (x > 0).sum()],
I get this error:
ValueError: cannot combine transform and aggregation operations
Any idea why I get different results for the same usage of code?
since there were 2 conditions to match for the 2nd groupby, I solved this by moving the filter into the df by creating a new column which is used as a filter (with both 2 filters).
than there was no problem to groupby
the order was the problem

Categories