Merging and combine columns with duplicates with Pandas - python

I'm new to Python Pandas and not quite found what I need so hoping for some help. I am trying to format a file that looks something like this
UserId,DomainId
TestTraderCAD,ALL
TestTraderCAD,CAD
TestTraderUSD,ALL
TestTraderUSD,USD
TestTraderGBP,ALL
TestTraderGBP,GBP
and produce a result that groups by the UserId and produces an output as follows where I also produce a count of the number of domains for each user
UserId,NumDomains,Domains
TestTraderCAD,2,ALL|CAD
TestTraderUSD,2,ALL|USD
TestTraderGBP,2,ALL|GBP
I've tried to get started by playing around with the groupby feature but not having much luck with it.
import pandas as pd
df = pd.read_csv('User_Domains.csv')
#print (df)
df2 = df.groupby(['UserId'],['DomainId']).sum()
print (df2)
Any help to get started would be appreciated.

Use agg
>>> df.groupby('UserId').agg({'UserId' : ['first', 'count'],
'DomainId': '|'.join})

Related

How to combine two dataframes into one like this, using pandas and python?

Please see the picture here.
I have two data frames and i need to convert it into single one, using merge or concat method and i am unable to do so. Can our community please help me doing this ?
import pandas as pd
df1 = pd.DataFrame.from_dict({'A':[1,2,2,3]})
df2 = pd.DataFrame.from_dict({'A':[1,2,3], 'B':['x', 'y', 'z']})
df1.merge(df2, how='left')

Python Pandas number regex for part number

import pandas as pd
df = pd.read_csv('test.csv', dtype='unicode')
df.dropna(subset=["Description.1"], inplace = True)
df_filtered = df[(df['Part'].str.contains("-")==True) & (df['Part'].str.len()==8)]
I am trying to get python pandas to only filter in the Part column to show numbers in this format: "###-####"
I cannot seem to figure out how to only show those. Any help would be greatly appreciated.
Right now, I have it where it filters part numbers with a '-' in them, and where the length is 8 digits long. Even with this, I am still getting some that aren't the correct format to our internal format.
Can't seem to find anything similar to this online, and I am fairly new to Python.
Thanks
A small example
import pandas as pd
from io import StringIO
df = pd.read_csv(StringIO("""name,dig
aaa,750-2220
bbb,12-214
ccc,120
ddd,1020-10"""))
df.loc[df.dig.str.contains(r"\d{3}-\d{4}")]
which outputs
name dig
0 aaa 750-2220

How to create new rows based on columns while keeping the index constant?

I have a dataframe similar to this one:
And I would like to create this dataframe:
I tried to implement this using df.melt() and df.transpose() but I did not succeed. Does anyone have any tips for that? I tried some solutions I found here but I guess this problem is slightly different from them.
You can use pd_wide_to_long() - link:
df = pd.wide_to_long(df,
stubnames='month',
i=['id', 'Name', 'City'],
j='month_num',
sep='_').rename(columns = {'month':'month_value' ,'month_num': 'month'}).reset_index()

Pandas .stack() issue

I have been trying to use pandas to do a simple stack and it seems I am missing something.
I have a csv file in this format
I thought I would use stack to get this
The number of columns and number of items will vary
df = pd.read_csv("z-textsource.csv")
data_stacked = df.stack()
data_stacked.to_csv("z-textsource_stacked.csv")
However, when I run the code I get this
Many thanks in advance!
item column is not index now. Please try:
df = pd.read_csv("z-textsource.csv", index_col=0)
And then the same code you use

How to get all groups from Dask DataFrameGroupBy, if I have more then one group by fields?

how can I get all unique groups in Dask from grouped data frame?
Let's say, we have the following code:
g = df.groupby(['Year', 'Month', 'Day'])
I have to iterate through all groups and process the data within the groups.
My idea was to get all unique value combinations and then iterate through the collection and call e.g.
g.get_group((2018,01,12)).compute()
for each of them... which is not going to be fast, but hopefully will work..
In Spark/Scala I can achieve smth like this using the following approach:
val res = myDataFrame.groupByKey(x => groupFunctionWithX(X)).mapGroups((key,iter) => {
process group with all the child records
} )
I am wondering, what is the best way to implement smth like this using Dask/Python?
Any assistance would be greatly appreciated!
Best, Michael
UPDATE
I have tried the following in python with pandas:
df = pd.read_parquet(path, engine='pyarrow')
g = df.groupby(('Year', 'Month', 'Day'))
g.apply(lambda x: print(x.Year[0], x.Month[0], x.Day[0], x.count()[0]))
And this was working perfectly fine. Afterwards, I have tried the same with Dask:
df2 = dd.read_parquet(path, engine='pyarrow')
g2 = df2.groupby(('Year', 'Month', 'Day'))
g2.apply(lambda x: print(x.Year[0], x.Month[0], x.Day[0], x.count()[0]))
This has led me to the following error:
ValueError: Metadata inference failed in `groupby.apply(lambda)`.
Any ideas what went wrong?
Computing one group at a time is likely to be slow. Instead I recommend using groupby-apply
df.groupby([...]).apply(func)
Like Pandas, the user-defined function func should expect a Pandas dataframe that has all rows corresponding to that group, and should return either a Pandas dataframe, a Pandas Series, or scalar.
Getting one group at a time can be cheap if your data is indexed by the grouping column
df = df.set_index('date')
part = df.loc['2018-05-01'].compute()
Given that you're grouping by a few columns though I'm not sure how well this will work.

Categories