Apply a function to each row python - python

I am trying to convert from UTC time to LocaleTime in my dataframe. I have a dictionary where I store the number of hours I need to shift for each country code. So for example if I have df['CountryCode'][0]='AU' and I have a df['UTCTime'][0]=2016-08-12 08:01:00 I want to get df['LocaleTime'][0]=2016-08-12 19:01:00 which is
df['UTCTime'][0]+datetime.timedelta(hours=dateDic[df['CountryCode'][0]])
I have tried to do it with a for loop but since I have more than 1 million rows it's not efficient. I have looked into the apply function but I can't seem to be able to put it to take inputs from two different columns.
Can anyone help me?

Without having a more concrete example its difficult but try this:
pd.to_timedelta(df.CountryCode.map(dateDict), 'h') + df.UTCTime

Related

differenc between using panda.drop_duplicate or value_count on whole frame or one column

I am a new python user just for finish the homework. But I am willing to dig deeper when I meet questions.
Ok the problem is from professor's sample code for data cleaning. He use drop.duplicates() and value_counts to check unique value of a frame, here are his codes:
spyq['sym_root'].value_counts() #method1
spyq['date'].drop_duplicates() #method2
Here is the output:
SPY 7762857 #method1
0 20100506 #method2
I use spyq.shape() to help you understand the spyq dataframe :
spyq.shape #out put is (7762857, 9)
the spqy is dataframe contains trading history for spy500 in one day when is 05/06/2010.
Ok after I see this, I wonder why he specify a column'date" or :'sym_root"; why he dont just use the whole spyq.drop_dupilicates() or spyq.value_counts(), so I have a try:
spyq.value_counts()
spyq.drop_duplicates()
Both output is (6993487, 9)
The row has decreased!
but from professor's sample code, there is no duplicated row existed because the row number from method 1 's output is exactly the same as the row number from spyq.shape!
I am so confused why output of whole dataframe:spyq.drop_duplicates() is not same as spyq['column'].drop_duplicated() when there is no repeat value!
I try to use
spyq.loc[spyq.drop_duplicates()]
to see what have dropped but it is error.
Can any one kindly help me? I know my question is kind of stupid but I just want to figure it out and I want to learn python from most fundmental part not just learn some code to finish homework.
Thanks!

How to use approx_count_distinct to count distinct combinations of two columns in a Spark DataFrame?

I have a Spark DataFrame (sdf) where each row shows an IP visiting a URL. I want to count distinct IP-URL pairs in this data frame and the most straightforward solution is sdf.groupBy("ip", "url").count(). However, since the data frame has billions of rows, precise counts can take quite a while. I'm not particularly familiar with PySpark -- I tried replacing .count() with .approx_count_distinct(), which was syntactically incorrect.
I searched "how to use .approx_count_distinct() with groupBy()" and found this answer. However, the solution suggested there (something along those lines: sdf.groupby(["ip", "url"]).agg(F.approx_count_distinct(sdf.url).alias("distinct_count"))) doesn't seem to give me the counts that I want. The method .approx_count_distinct() can't take two columns as arguments, so I can't write sdf.agg(F.approx_count_distinct(sdf.ip, sdf.url).alias("distinct_count")), either.
My question is, is there a way to get .approx_count_distinct() to work on multiple columns and count distinct combinations of these columns? If not, is there another function that can do just that and what's an example usage of it?
Thank you so much for your help in advance!
Group with expressions and alias as needed. Lets try:
df.groupBy("ip", "url").agg(expr("approx_count_distinct(ip)").alias('ip_count'),expr("approx_count_distinct(url)").alias('url_count')).show()
Your code sdf.groupby(["ip", "url"]).agg(F.approx_count_distinct(sdf.url).alias("distinct_count")) will give a value of 1 to every group since you are counting the value of one of the grouping column; url.
If you want to count distinct of IP-URL pairs using approx_count_distinct function, you can compound them in an array then apply the function. It would be something like this
sdf.selectExpr("approx_count_distinct(array(ip, url)) as distinct_count")

I want to subtract one column from another in pandas, but I keep getting a copy error. Is there a better way to do this operation?

I have a data frame TB_greater_2018 that 3 columns: country, e_inc_100k_2000 and e_inc_100k_2018. I would like to subtract e_inc_100k_2000 from e_inc_100k_2018 and then use those values returned to create a new column of the differences and then sort by the countries with the largest difference. My current code is:
case_increase_per_100k = TB_greater_2018["e_inc_100k_2018"] - TB_greater_2018["e_inc_100k_2000"]
TB_greater_2018["case_increase_per_100k"] = case_increase_per_100k
TB_greater_2018.sort_values("case_increase_per_100k", ascending=[False]).head()
When I run this, I get a SettingwithCopyWarning. Is there a way to do this without getting this warning? Or just overall a better way of accomplishing the task?
You can do
TB_greater_2018["case_increase_per_100k"] = TB_greater_2018["e_inc_100k_2018"] - TB_greater_2018["e_inc_100k_2000"]
TB_greater_2018.sort_values("case_increase_per_100k", ascending=[False]).head()
It looks like the error is from finding the difference and using that as a column in separate operations, although tbh I'm not clear why that would be.

PySpark - converting hour and minute data to seconds

I have a given time of XXh:YYm (ex 1h:23m) that I'm trying to convert to seconds. The tricky part is that if it is less than an hour then the time will be given as just YYm (eg 52m).
I am currently using
%pyspark
newColumn = unix_timestamp(col("time"), "H:mm")
dataF.withColumn('time', regexp_replace('time', 'h|m', '')).withColumn("time", newColumn).show()
This works great for removing the h and m letters and then converting to seconds, but throws a null when the time is less than an hour as explained above since it's not actually on the H:mm format. What's a good approach to this? I keep trying different things that seems to overcomplicate it, and I still haven't found a solution.
I am leaning toward some sort of conditional like
if value contains 'h:' then newColumn = unix_timestamp(col("time"), "H:mm")
else newColumn = unix_timestamp(col("time"), "mm")
but I am fairly new to pyspark and not sure how to do this to get the final output. I am basically looking for an approach that will convert a time to seconds and can handle formats of '1h:23m' as well as '53m'.
This should do the trick, assuming time column is stringtype. Just used when otherwise to separate the two different times(by contains 'h') and used substring to get desired minutes.
from pyspark.sql import functions as F
df.withColumn("seconds", F.when(F.col("time").contains("h"), F.unix_timestamp(F.regexp_replace("time", "h|m", ''),"H:mm"))\
.otherwise(F.unix_timestamp(F.substring("time",1,2),"mm")))\
.show()
+------+-------+
| time|seconds|
+------+-------+
|1h:23m| 4980|
| 23m| 1380|
+------+-------+
You can use "unix_timestamp" function to convert DateTime to unix timestamp in seconds.
You can refer to one of my blog on the Spark DateTime function and go to "unix_timestamp" section.
https://medium.com/expedia-group-tech/deep-dive-into-apache-spark-datetime-functions-b66de737950a
Regards,
Neeraj

Conditionally add 1 to int element of a NumPy record array

I have a large NumpPy record array 250 million rows by 9 columns (MyLargeRec). and I need to add 1 to the 7th column (dtype = "int") if the the index of that row is in another list or 300,000 integers(MyList). If this was a normal python list I would use the following simple code...
for m in MyList:
MyLargeRec[m][6]+=1
However I can not seem to get a similar functionality using the NumPy record array. I have tried a few options such as nditer, but this will not let me select the specific indices I want.
Now you may say that this is not what NumPy was designed for, so let me explain why I a using this format. I am using it is because it only takes 30 mins to build the record array from scratch whereas it takes over 24 hours if using a conventional 2D list format. I spent all of yesterday trying to find a way to do this and could not, I eventually converted it to a list using...
MyLargeList = list(MyLargeRec)
so I could use the simple code above to achieve what I want, however this took 8.5 hours to perform this function.
Therefore, can anyone tell me first, is there a method to achieve what i want within a NumPy record array? and second, if not, any ideas on the best methods within python 2.7 to create, update and store such a large 2D matrix?
Many thanks
Tom
your_array[index_list, 6] += 1
Numpy allows you to construct some pretty neat slices. This selects the 6th column of all rows in your list of indices and adds 1 to each. (Note that if an index appears multiple times in your list of indices, this will still only add 1 to the corresponding cell.)
This code...
for m in MyList:
MyLargeRec[m][6]+=1
does actually work, silly question by me.

Categories