so I am creating a dummy data for a project, and I have a million row of this table:
you can see the sub-reason column contains NaN values all of it cz i'm creating this data. what I want is to put a value based on the Reason column:
if the Reason is 'Maintenance' I want to put a random value between: ['Indoor Connection','Last Mile Connection']
if the Reason is 'New Connection'I want to put a random value between: ['Delayed Connection','Connection Request']
if the Reason is 'Billing' I want to put a random value between: ['Update Request','Change Personal Info']
if the Reason is Complaints I want to put a random value between: ['Wire Cut','Bad Service']
so what I did is a very basic approach:
for i in range(len(cop2)):
if cop2['Reason'].loc[i][0] == 'Maintenance':
cop2['Sub-Reason'].loc[i][0] = np.random.choice(list(subReason1))
if cop2['Reason'].loc[i][0] == 'Connection':
cop2['Sub-Reason'].loc[i][0] = np.random.choice(list(subReason2))
if co2['Reason'].loc[i][0] == 'Billing':
cop2['Sub-Reason'].loc[i][0] = np.random.choice(list(subReason3))
if cop2['Reason'].loc[i][0] == 'Complaints':
cop2['Sub-Reason'].loc[i][0] = np.random.choice(list(subReason4))
it works fine but it takes a veryyyy long time (50mins). how can I do this in a way that it doesn't take a long time but works fine?
Did you try apply method? , it's probably faster
df['Sub-Reason'] = df['Reason'].apply(
lambda x: np.random.choice(list(subReason1)) if x=='Maintenance'
else (np.random.choice(list(subReason2)) if x=='Connection'
else (np.random.choice(list(subReason3)) if x=='Billing'
else np.random.choice(list(subReason4))) ))
Edited since OP 's column is actually a list , and using the first value
df['Sub-Reason'] = df['Reason'].apply(
lambda x: np.random.choice(list(subReason1)) if x[0]=='Maintenance'
else (np.random.choice(list(subReason2)) if x[0]=='Connection'
else (np.random.choice(list(subReason3)) if x[0]=='Billing'
else np.random.choice(list(subReason4))) ))
Related
I have a data frame in pandas, one of the columns contains time intervals presented as strings like 'P1Y4M1D'.
The example of the whole CSV:
oci,citing,cited,creation,timespan,journal_sc,author_sc
0200100000236252421370109080537010700020300040001-020010000073609070863016304060103630305070563074902,"10.1002/pol.1985.170230401","10.1007/978-1-4613-3575-7_2",1985-04,P2Y,no,no
...
I created a parsing function, that takes that string 'P1Y4M1D' and returns an integer number.
I am wondering how is it possible to change all the column values to parsed values using that function?
def do_process_citation_data(f_path):
global my_ocan
my_ocan = pd.read_csv("citations.csv",
names=['oci', 'citing', 'cited', 'creation', 'timespan', 'journal_sc', 'author_sc'],
parse_dates=['creation', 'timespan'])
my_ocan = my_ocan.iloc[1:] # to remove the first row iloc - to select data by row numbers
my_ocan['creation'] = pd.to_datetime(my_ocan['creation'], format="%Y-%m-%d", yearfirst=True)
return my_ocan
def parse():
mydict = dict()
mydict2 = dict()
i = 1
r = 1
for x in my_ocan['oci']:
mydict[x] = str(my_ocan['timespan'][i])
i +=1
print(mydict)
for key, value in mydict.items():
is_negative = value.startswith('-')
if is_negative:
date_info = re.findall(r"P(?:(\d+)Y)?(?:(\d+)M)?(?:(\d+)D)?$", value[1:])
else:
date_info = re.findall(r"P(?:(\d+)Y)?(?:(\d+)M)?(?:(\d+)D)?$", value)
year, month, day = [int(num) if num else 0 for num in date_info[0]] if date_info else [0,0,0]
daystotal = (year * 365) + (month * 30) + day
if not is_negative:
#mydict2[key] = daystotal
return daystotal
else:
#mydict2[key] = -daystotal
return -daystotal
#print(mydict2)
#return mydict2
Probably I do not even need to change the whole column with new parsed values, the final goal is to write a new function that returns average time of ['timespan'] of docs created in a particular year. Since I need parsed values, I thought it would be easier to change the whole column and manipulate a new data frame.
Also, I am curious what could be a way to apply the parsing function on each ['timespan'] row without modifying a data frame, I can only assume It could be smth like this, but I don't have a full understanding of how to do that:
for x in my_ocan['timespan']:
x = parse(str(my_ocan['timespan'])
How can I get a column with new values? Thank you! Peace :)
A df['timespan'].apply(parse) (as mentioned by #Dan) should work. You would need to modify only the parse function in order to receive the string as an argument and return the parsed string at the end. Something like this:
import pandas as pd
def parse_postal_code(postal_code):
# Splitting postal code and getting first letters
letters = postal_code.split('_')[0]
return letters
# Example dataframe with three columns and three rows
df = pd.DataFrame({'Age': [20, 21, 22], 'Name': ['John', 'Joe', 'Carla'], 'Postal Code': ['FF_222', 'AA_555', 'BB_111']})
# This returns a new pd.Series
print(df['Postal Code'].apply(parse_postal_code))
# Can also be assigned to another column
df['Postal Code Letter'] = df['Postal Code'].apply(parse_postal_code)
print(df['Postal Code Letter'])
Kaggle Dataset(working on)- Newyork Airbnb
Created with a raw data code for running better explanation of the issue
`airbnb= pd.read_csv("https://raw.githubusercontent.com/rafagarciac/Airbnb_NYC-Data-Science_Project/master/input/new-york-city-airbnb-open-data/AB_NYC_2019.csv")
airbnb[airbnb["host_name"].isnull()][["host_name","neighbourhood_group"]]
`DataFrame
I would like to fill the null values of "host_name" based on the "neighbourhood_group" column entities.
like
if airbnb['host_name'].isnull():
airbnb["neighbourhood_group"]=="Bronx"
airbnb["host_name"]= "Vie"
elif:
airbnb["neighbourhood_group"]=="Manhattan"
airbnb["host_name"]= "Sonder (NYC)"
else:
airbnb["host_name"]= "Michael"
(this is wrong,just to represent the output format i want)
I've tried using if statement but I couldn't apply in a correct way. Could you please me solve this.
Thanks
You could try this -
airbnb.loc[(airbnb['host_name'].isnull()) & (airbnb["neighbourhood_group"]=="Bronx"), "host_name"] = "Vie"
airbnb.loc[(airbnb['host_name'].isnull()) & (airbnb["neighbourhood_group"]=="Manhattan"), "host_name"] = "Sonder (NYC)"
airbnb.loc[airbnb['host_name'].isnull(), "host_name"] = "Michael"
Pandas has a special method to fill NA values:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html
You may create a dict with values for "host_name" field using "neighbourhood_group" values as keys and do this:
host_dict = {'Bronx': 'Vie', 'Manhattan': 'Sonder (NYC)'}
airbnb['host_name'] = airbnb['host_name'].fillna(value=airbnb[airbnb['host_name'].isna()]['neighbourhood_group'].map(host_dict))
airbnb['host_name'] = airbnb['host_name'].fillna("Michael")
"value" argument here may be a Series of values.
So, first of all, we create a Series with "neighbourhood_group" values which correspond to our missing values by using this part:
neighbourhood_group_series = airbnb[airbnb['host_name'].isna()]['neighbourhood_group']
Then using map function together with "host_dict" we get a Series with values that we want to impute:
neighbourhood_group_series.map(host_dict)
Finally we just impute in all other NA cells some default value, in our case "Michael".
You can do it with:
ornek = pd.DataFrame({'samp1': [None, None, None],
'samp2': ["sezer", "bozkir", "farkli"]})
def filter_by_col(row):
if row["samp2"] == "sezer":
return "ping"
if row["samp2"] == "bozkir":
return "pong"
return None
ornek.apply(lambda x: filter_by_col(x), axis=1)
I am calculating a value of net balance from a condition and I want to store it in a new variable altogether. The new variable should store this calculated value as integer or a float and not as an array
I have tried the following code:
#variable = something if condition else something_else
mar_final_bal = x_start_bal+df2['credit_line']+df2['Net_Balance'] if
df2['month' == 'March-2016']
apr_final_bal = mar_final_bal+df2['credit_line']+df2['Net_Balance'] if
df2['month' == 'Apr-2016']
mar_final_bal and apr_final_bal are my two variables that I want to create using the conditions on the right side
It is evident that you are new to using Pandas. The syntax looks more pseudo-like than pandas code. IIUC, this is what you meant:
mar_final_bal = x_start_bal+df2.loc[df2['month'] == 'March-2016', 'credit_line'].sum() + df2.loc[df2['month'] == 'March-2016', 'Net_Balance'].sum()
apr_final_bal = mar_final_bal+df2.loc[df2['month'] == 'Apr-2016', 'credit_line'].sum() + df2.loc[df2['month'] == 'Apr-2016', 'Net_Balance'].sum()
I have a dataframe in Python using pandas. It has 2 columns called 'dropoff_latitude' and 'pickup_latitude'. I want to make a function that will create a 3rd column based on these 2 variables (runs them through an api).
So I wrote a function:
def dropoff_info(row):
dropoff_latitude = row['dropoff_latitude']
dropoff_longitude = row['dropoff_longitude']
dropoff_url2 = "http://data.fcc.gov/api/block/find?format=json&latitude=%s&longitude=%s&showall=true" %(dropoff_latitude,dropoff_longitude)
dropoff_resp2 = requests.get(dropoff_url2)
dropoff_results2 = json.loads(dropoff_resp2.text)
dropoffinfo = dropoff_results2["Block"]["FIPS"][2:11]
return dropoffinfo
then I would run it as
df['newcolumn'] = dropoffinfo(df)
However it doesn't work.
Upon troubleshooting I find that when I print dropoff_latitude it looks like this:
0 40.773345947265625
1 40.762149810791016
2 40.770393371582031
...
And so I think that the URL can't get generated. I want dropoff_latitude to look like this when printed:
40.773345947265625
40.762149810791016
40.770393371582031
...
And I don't know how to specify that I want just the actual content part.
When I tried
dropoff_latitude = row['dropoff_latitude'][1]
dropoff_longitude = row['dropoff_longitude'][1]
It just gave me the values from the 1st row so that obviously didn't work.
Ideas please? I am very new to working with dataframes... Thank you!
Alex - with pandas we typically like to avoid loops, but in your particular case, the need to ping a remote server for data pretty much requires it. So I'd do something like the following:
l = []
for i in df.index:
dropoff_latitude = df.loc[i, 'dropoff_latitude']
dropoff_longitude = df.loc[i, 'dropoff_longitude']
dropoff_url2 = "http://data.fcc.gov/api/block/find?format=json&latitude=%s&longitude=%s&showall=true" %(dropoff_latitude,dropoff_longitude)
dropoff_resp2 = requests.get(dropoff_url2)
dropoff_results2 = json.loads(dropoff_resp2.text)
l.append(dropoff_results2["Block"]["FIPS"][2:11])
df['new'] = l
The key here is the .loc[i, ...] bit that gives you the ability to go through each row one by one, and call out the associated column to create the variables to send to your API.
Regarding your question about a drain on your memory - that's a little above my pay-grade, but I really don't think you have any other options in this case (unless your API has some kind of batch request that allows you to pull a larger data set in one call).
I am learning Python's Pandas library using kaggle's titanic tutorial. I am trying to create a function which will calculate the nulls in a column.
My attempt below appears to print the entire dataframe, instead of null values in the specified column:
def null_percentage_calculator(df,nullcolumn):
df_column_null = df[nullcolumn].isnull().sum()
df_column_null_percentage = np.ceil((df_column_null /testtotal)*100)
print("{} percent of {} {} are NaN values".format(df_column_null_percentage,df,nullcolumn))
null_percentage_calculator(train,"Age")
My previous (and very first stack overflow question) was a similar problem, and it was explained to me that the .index method in pandas is undesirable and I should try and use other methods like [ ] and .loc to explicitly refer to the column.
So I have tried this:
df_column_null=[df[nullcolumn]].isnull().sum()
I have also tried
df_column_null=df[nullcolumn]df[nullcolumn].isnull().sum()
I am struggling to understand this aspect of Pandas. My non function method works fine:
Train_Age_Nulls = train["Age"].isnull().sum()
Train_Age_Nulls_percentage = (Train_Age_Nulls/traintotal)*100
Train_Age_Nulls_percentage_rounded = np.ceil(Train_Age_Nulls_percentage)
print("{} percent of Train's Age are NaN values".format(Train_Age_Nulls_percentage_rounded))
Could anyone let me know where I am going wrong?
def null_percentage_calculator(df,nullcolumn):
df_column_null = df[nullcolumn].isnull().sum()
df_column_null_percentage = np.ceil((df_column_null /testtotal)*100)
# what is testtotal?
print("{} percent of {} {} are NaN values".format(df_column_null_percentage,df,nullcolumn))
I would do this with:
def null_percentage_calculator(df,nullcolumn):
nulls = df[nullcolumn].isnull().sum()
pct = float(nulls) / len(df[nullcolumn]) # need float because of python division
# if you must you can * 100
print "{} percent of column {} are null".format(pct*100, nullcolumn)
beware of python integer division where 63/180 = 0
if you want a float out, you have to put a float in.