Strange behaviour in pandas concat - python

I was just curious as to what's going on here. I have 13 dataframes that look something like this:
df1:
time val
00:00 1
00:01 2
00:02 5
00:03 8
df2:
time val
00:04 5
00:05 12
00:06 4
df3:
time val
00:07 8
00:08 24
00:09 3
and so on. As you can see each dataframe continues the time exactly where the other left off, which means ideally I would like them in one dataframe for simplicity sake. Note the example ones I used are significantly smaller then my actual ones. However, upon using the following:
df = pd.concat([pd.read_csv(i, usecols=[0,1,2]) for i in sample_files])
Where these 13 dataframes are produced through that list comprehension, I get a very strange result. It's as if I have set axis=1 inside the pd.concat() function. If I try to reference a column, say val
df['val']
Pandas returns something that looks like this:
0 1
1 2
...
2 5
3 8
Name: val, Length: 4, dtype: float64
In this output it does not specify what happened to the other 11 val columns. If I then reference an index, as follows:
df['val'][0]
It returns:
0 1
0 5
0 8
Name: val, dtype: float64
which is the first index of each column. I am unsure as to why pandas is behaving like this, as I would imagine it just joins together columns with similar header names, but obviously this isn't the case.
If sometime could explain this that would be great.

I believe your issue is that you are not resetting the index after concatenation, but before selecting the data.
Try:
df = pd.concat([pd.read_csv(i, usecols=[0,1,2]) for i in sample_files])
df = df.reset_index(Drop=True)
df['val'][0]

Related

Biweekly pandas data with period label

I'm trying to create a biweekly periods from pandas data frame. For instance like that
import pandas as pd
date_range = pd.date_range("2022-04-01", "2022-04-30", freq="B")
test_data = pd.DataFrame(np.arange(len(date_range)), index=date_range)
I'd like to have a Period index with 2 weeks length. I have assumed that the pandas way to do it is the following
test_data.resample("2W", kind="period").last()
However the labels I'm getting are
0
2022-03-28/2022-04-03 5
2022-04-11/2022-04-17 15
2022-04-25/2022-05-01 20
I'd expect to see something like this
0
2022-03-21/2022-04-03 0
2022-04-04/2022-04-17 10
2022-04-18/2022-05-01 20
Another interesting point is that changing kind="timestamp" changes the values to the values I'd like to see at the end.
0
2022-04-03 0
2022-04-17 10
2022-05-01 20
Is there any native way to get biweekly index from pandas, or better to do it manually?
You can try pandas.Grouper
df = test_data.groupby(pd.Grouper(freq='2W')).last()
print(df)
0
2022-04-03 0
2022-04-17 10
2022-05-01 20

Pandas: how to get the rows that has the maximum value_count on a column grouping by another column as a dataframe

I have three columns in a pandas dataframe, Date, Hour and Content. I want to get the hour in a day when there is the most content of that day. I am using messages.groupby(["Date", "Hour"]).Content.count().groupby(level=0).tail(1). I don't know what groupby(level=0) is doing here. It outputs as follows-
Date Hour
2018-04-12 23 4
2018-04-13 21 43
2018-04-14 9 1
2018-04-15 23 29
2018-04-16 17 1
..
2020-04-23 20 1
2020-04-24 22 1
2020-04-25 20 1
2020-04-26 23 32
2020-04-27 23 3
This is a pandas series object, and my desired Date and Hour columns are MultiIndex here. If I try to convert the MultiIndex object to dataframe using pd.DataFrame(most_active.index), most_active being the output of the previous code, it creates a dataframe of tuples as below-
0
0 (2018-04-12, 23)
1 (2018-04-13, 21)
2 (2018-04-14, 9)
3 (2018-04-15, 23)
4 (2018-04-16, 17)
.. ...
701 (2020-04-23, 20)
702 (2020-04-24, 22)
703 (2020-04-25, 20)
704 (2020-04-26, 23)
705 (2020-04-27, 23)
But I need two separate columns of Date and Hour. What is the best way for this?
Edit because I misunderstood your question
First, you have to count the total content by date-hour, just like you did:
df = messages.groupby(["Date", "Hour"], as_index=False).Content.count()
Here, I left the groups in their original columns by passing the parameter as_index=False.
Then, you can run the code below, provided in the original answer:
Supposing you have unique index IDs (if not, just do df.reset_index(inplace=True)), you can use idxmax method in groupby. It will return the index with the biggest value per group, then you can use them for slicing the dataframe.
For example:
df.loc[df.groupby(['Date', 'Hour'])['Content'].idxmax()]
As an alternative (without using groupby), you can first sort the values in descending order, them remove the Date-Hour duplicates:
df.sort_values('Content', ascending=False).drop_duplicates(subset=['Date', 'Hour'])
Finally, you get a MultiIndex with the set_index() method:
df.set_index(['Date','Hour'])

Efficient way to fill missing dates by group in pandas?

So, I have a dataframe like this one:
date ID value
2018-01-01 A 10
2018-02-01 A 11
2018-04-01 A 13
2017-08-01 B 20
2017-10-01 B 21
2017-11-01 B 23
Each group can have very different dates, and there's about 400k groups. So, what I want to do is to fill the missing dates of each group in an efficient way, so it looks like this:
date ID value
2018-01-01 A 10
2018-02-01 A 11
2018-03-01 A nan
2018-04-01 A 13
2017-08-01 B 20
2017-09-01 B nan
2017-10-01 B 21
2017-11-01 B 23
I've tried two approaches:
df2 = df.groupby('ID').apply(lambda x: x.set_index('date').resample('D').pad())
And also:
df2= df.set_index(['date','ID']).unstack().stack(dropna=False).reset_index()
df2= df2.sort_values(by=['ID','date']).reset_index(drop=True)
df2= df2[df2.groupby('ID').value.ffill().notna()]
df2 = df2[df2.groupby('ID').value.bfill().notna()]
The first one, as it uses apply, it's very slow. I guess I could use something else instead of pad so I get nan instead of the previous value, but I'm not sure that will impact the perfomance enough. I waited around 15 minutes and it didn't finish running.
The second one fills from the first date in the whole dataframe to the last one, for every group, which brings a massive dataframe. Afterward I drop all leading and trailing nan generated by this method. This is quite faster than the first option, but doesn't seem to be the best one. Is there a better way to do this, that's better for big dataframes?

How to link two dataframes based on the string similarity of one column

I have two dataframes, both have an ID and a Column Name that contains Strings. They might look like this:
Dataframes:
DF-1 DF-2
--------------------- ---------------------
ID Name ID Name
1 56 aaeessa 1 12 H.P paRt 1
2 98 1o7v9sM 2 76 aa3esza
3 175 HP. part 1 3 762 stakoverfl
4 2 stackover 4 2 lo7v9Sm
I would like to compute the string similarity (Ex: Jaccard, Levenshtein) between one element with all the others and select the one that has the highest score. Then match the two IDs so I can join the complete Dataframes later. The resulting table should look like this:
Result:
Result
-----------------
ID1 ID2
1 56 76
2 98 2
3 175 12
4 2 762
This could be easily achieved using a double for loop, but I'm looking for an elegant (and faster way) to accomplish this, maybe lambdas list comprehension, or some pandas tool. Maybe some combination of groupby and idxmax for the similarity score but I can't quite come up with the soltution by myself.
EDIT: The DataFrames are of different lenghts, one of the purposes of this function is to determine which elements of the lesser dataframe appear in the greater dataframe and match those, discarding the rest. So in the resulting table should only appear pairs of IDs that match, or pairs of ID1 - NaN (assuming DF-1 has more rows than DF-2).
Using the pandas dedupe package: https://pypi.org/project/pandas-dedupe/
You need to train the classifier with human input and then it will use the learned setting to match the whole dataframe.
first pip install pandas-dedupe and try this:
import pandas as pd
import pandas_dedupe
df1=pd.DataFrame({'ID':[56,98,175],
'Name':['aaeessa', '1o7v9sM', 'HP. part 1']})
df2=pd.DataFrame({'ID':[12,76,762,2],
'Name':['H.P paRt 1', 'aa3esza', 'stakoverfl ', 'lo7v9Sm']})
#initiate matching
df_final = pandas_dedupe.link_dataframes(df1, df2, ['Name'])
# reset index
df_final = df_final.reset_index(drop=True)
# print result
print(df_final)
ID Name cluster id confidence
0 98 1o7v9sm 0.0 1.000000
1 2 lo7v9sm 0.0 1.000000
2 175 hp. part 1 1.0 0.999999
3 12 h.p part 1 1.0 0.999999
4 56 aaeessa 2.0 0.999967
5 76 aa3esza 2.0 0.999967
6 762 stakoverfl NaN NaN
you can see matched pairs are assigned a cluster and confidence level. unmatched are nan. you can now analyse this info however you wish. perhaps only take results with a confidence level above 80% for example.
I suggest you a library called Python Record Linkage Toolkit.
Once you import the library, you must index the sources you intend to compare, something like this:
indexer = recordlinkage.Index()
#using url as intersection
indexer.block('id')
candidate_links = indexer.index(df_1, df_2)
c = recordlinkage.Compare()
Let's say you want to compare based on the similiraties of strings, but they don't match exactly:
c.string('name', 'name', method='jarowinkler', threshold=0.85)
And if you want an exact match you should use:
c.exact('name')
Using my fuzzy_wuzzy function from the linked answer:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
mrg = fuzzy_merge(df1, df2, 'Name', 'Name', threshold=70)\
.merge(df2, left_on='matches', right_on='Name', suffixes=['1', '2'])\
.filter(like='ID')
Output
ID1 ID2
0 56 76
1 98 2
2 175 12
3 2 762

Iterate over dates in a Pandas Dataframe to get the count of a different column per week

I am a java developer finding it a bit tricky to switch to python and Pandas. Im trying to iterate over dates of a Pandas Dataframe which looks like below,
sender_user_id created
0 1 2016-12-19 07:36:07.816676
1 33 2016-12-19 07:56:07.816676
2 1 2016-12-19 08:14:07.816676
3 15 2016-12-19 08:34:07.816676
what I am trying to get is a dataframe which gives me a count of the total number of transactions that have occurred per week. From the forums I have been able to get syntax for 'for loops' which iterate over indexes only. Basically I need a result dataframe which looks like this. The value field contains the count of sender_user_id and the date needs to be modified to show the starting date per week.
date value
0 2016-12-09 20
1 2016-12-16 36
2 2016-12-23 56
3 2016-12-30 32
Thanks in advance for the help.
I think you need resample by week and aggregate size:
#cast to datetime if necessary
df.created = pd.to_datetime(df.created)
print (df.resample('W', on='created').size().reset_index(name='value'))
created value
0 2016-12-25 4
If need another offsets:
df.created = pd.to_datetime(df.created)
print (df.resample('W-FRI', on='created').size().reset_index(name='value'))
created value
0 2016-12-23 4
If need number of unique values per week aggregate by nunique:
df.created = pd.to_datetime(df.created)
print (df.resample('W-FRI', on='created')['sender_user_id'].nunique()
.reset_index(name='value'))
created value
0 2016-12-23 3

Categories