How to merge two data frames in pandas?

How to merge two data frames in pandas? - python

I have two pandas dataframes
Unnamed: 0 sentiment numberagreed tweetid tweet
0 0 2 6 219584 Apple processa a Samsung no Japão - Notícias -...
1 1 1 3 399249 É O JACKI CHAN !!! RT #user ESSE É DOS MEUS!!!...
2 2 3 3 387155 Eras o samsung galaxy tab e muito lerdo para t...
3 3 3 3 205458 Dizem que a coisa mais triste que o homem enfr...
4 4 3 3 2054404 RAIVA vou ter que ir com meu nike dinovo pra e...
tweetid sent
219584 0.494428
399249 0.789241
387155 0.351972
205458 0.396907
2054404 0.000000
They are not the same length and there are some missing values in the second data frame
I want to merge the two data frames based on the tweetid and drop the missing values

Use pd.merge
pd.merge(left=df1, right=df2, on='tweetid', how='inner')
Because you take the inner, non-overlapping parts will be thrown away. on='tweetid' merges it on tweetid.

There is probably an extra character somewhere at the beginning of your file. Are you reading the data from a csv file? Post the source code of how you are reading the data.
or name your columns on both dataframe.
df_tweets.columns = ("tweetid", "sent")

Related

Python pandas 'reverse' split in a dataframe

i have a dataframe with a column called details that have this data :
130 m² - 3 Pièces - 2 Chambres - 2 Salles de bains - Bon état - 20-30 ans -
when i want to get the first data 130 i did this :
df['superficie'] = df['details'].str.split('m²').str[0]
its gives me 130 in a new column that called "superficie"
for the the seconde data i did this :
df['nbPieces']= (df['details'].str.split('-').str[1].str.split('Pièces').str[0])
it gives me 3 in a new column that called "nbPieces"
but my problème is if i want to get the 2 of the champbres and 2 of the salles de bains and the 20-30 near of "ans" , how can i do that, i need to add them to new columns (nbChambre , nbSalleDeBain, NbAnnee)
thanks in advance .

I suggest you to use regular expressions in pandas for this kind of operations:
import pandas as pd
df = pd.DataFrame()
df['details'] = ["130 m² - 3 Pièces - 2 Chambres - 2 Salles de bains - Bon état - 20-30 ans -"]
df['nb_chbr'] = df['details'].str.split(" - ").str[2].str.findall(r'\d+').str[0].astype('int64')
df['nb_sdb'] = df['details'].str.split(" - ").str[3].str.findall(r'\d+').str[0].astype('int64')
df['nb_annee'] = df['details'].str.split(" - ").str[5].str.findall(r'\d+').str[0].astype('int64')
print(df)
Output:
details nb_chbr nb_sdb nb_annee
0 130 m² - 3 Pièces - 2 Chambres - 2 Salles de b... 2 2 20
Moreover, I used " - " as a split string. It returns a better list in your case. And for the "Nombre d'années" case I simply took the first integer that appears in the list, I don't know if it suits you.
Finally there may be a problem in your dataframe, 2 chambres and 2 salles de bain should be a 4 pièces flat ^^

Pandas Groupby Newbie Conundrum

**Using Pandas 1.4.2, Python 3.9.12
I have a data frame as follows:
Neighbourhood No-show
0 JARDIM DA PENHA No
1 JARDIM DA PENHA Yes
2 MATA DA PRAIA No
3 PONTAL DE CAMBURI No
4 JARDIM DA PENHA No
5 MARIA ORTIZ Yes
6 MARIA ORTIZ Yes
7 MATA DA PRAIA Yes
8 PONTAL DE CAMBURI No
9 MARIA ORTIZ No
How would I use groupby to get the total(count) of 'Yes' and total(count) of 'No' grouped by each 'Neighbourhood'? I keep getting 'NoNoYesNo' if I use .sum() and if I can get these grouped correctly by Neighbourhood I think I can graph much easier.
This data frame is truncated as there are numerous other columns but these are the only 2 I care about for this exercise.

Use df.groupby() as follows:
totals = df.groupby(['Neighbourhood','No-show'])['No-show'].count()
print(totals)
Neighbourhood No-show
JARDIM DA PENHA No 2
Yes 1
MARIA ORTIZ No 1
Yes 2
MATA DA PRAIA No 1
Yes 1
PONTAL DE CAMBURI No 2
Name: No-show, dtype: int64
Good point raised by #JonClements: you might want to add .unstack(fill_value=0) to that. So:
totals_unstacked = df.groupby(['Neighbourhood','No-show'])['No-show'].count().unstack(fill_value=0)
print(totals_unstacked)
No-show No Yes
Neighbourhood
JARDIM DA PENHA 2 1
MARIA ORTIZ 1 2
MATA DA PRAIA 1 1
PONTAL DE CAMBURI 2 0

You can use:
df[['Neighbourhood', 'No-show']].value_counts()

Extract the YYYY year from two string columns and put it in a new column, keeping NaN values

In a dataframe I have two columns with the information of when some football players make their debut.The columns are called 'Debut' and 'Debut Deportivo'. I have to create a function to create a new column with the YYYY year information of both columns keeping the Nan values from both when applied. Let me show and example:
With the code I have wrote until now, I am able to get the value from one column a put it in the new one, but I've never reach the form to combine both.
The result should be something like this:
Debut
Debut Deportivo
fecha_debut
27 de mayo de 2006
2006(UD Vecindario)
2006
21 de agosto de 2010
11 de agosto de 2010(Portuguesa)
2010
21 de agosto de 2010
NaN
2010
NaN
NaN
NaN
Can you help me to get this code right please
df_4['Debut deportivo'].fillna('0000',inplace=True)
df_4['Debut'].fillna('0000', inplace=True)
def find_year(x):
año = re.search('\d{4}', x)
return int(año.group(0)) if año else 0
df_4['fecha_debut'] = df_4['Debut'].map(find_year)
df_4['fecha_debut'] = df_4 ['Debut deportivo'].apply(lambda x: np.nan if x.find('2')==-1 else x[x.find('0')-1:x.find('(')])
df_4['club_debut'] = df_4 ['Debut deportivo'].apply(lambda x: np.nan if x.find ('(')==-1 else x[x.find('(')+1:x.find(')')])
df_4['fecha_debut'] = df_4['fecha_debut'].replace(0,np.nan)
# No modifiques las siguientes lineas
assert(isinstance(df, pd.DataFrame))
return df```

I suggest you use str.extract + combine_first
df['fecha_debut'] = df['Debut'].str.extract(r'(\d{4})').combine_first(df['Debut Deportivo'].str.extract(r'(\d{4})'))
print(df)
Output
Debut Debut Deportivo fecha_debut
0 27 de mayo de 2006 2006(UD Vecindario) 2006
1 21 de agosto de 2010 11 de agosto de 2010(Portuguesa) 2010
2 21 de agosto de 2010 NaN 2010
3 NaN NaN NaN
For more on how to work with strings in pandas see this.
UPDATE
If you need the column to be numeric you could do:
df['fecha_debut'] = pd.to_numeric(df['fecha_debut']).astype(pd.Int32Dtype())
Note that because you have missing values in the column it cannot be of type int32. It can be either nullable integer or float. For more on working with missing data see this.

How to create a duplicate flag (column) that counts duplicate rows based on two columns?

I have the following dataframe and would like to create a column at the end called "dup" showing the number of times the row shows up based on the "Seasons" and "Actor" columns. Ideally the dup column would look like this:
Name Seasons Actor dup
0 Stranger Things 3 Millie 1
1 Game of Thrones 8 Emilia 1
2 La Casa De Papel 4 Sergio 1
3 Westworld 3 Evan Rachel 1
4 Stranger Things 3 Millie 2
5 La Casa De Papel 4 Sergio 1

This should do what you need:
df['dup'] = df.groupby(['Seasons', 'Actor']).cumcount() + 1
Output:
Name Seasons Actor dup
0 Stranger Things 3 Millie 1
1 Game of Thrones 8 Emilia 1
2 La Casa De Papel 4 Sergio 1
3 Westworld 3 Evan Rachel 1
4 Stranger Things 3 Millie 2
5 La Casa De Papel 4 Sergio 2
As Scott Boston mentioned, according to your criteria the last row should also be 2 in the dup column.
Here is a similar post that can provide you more information. SQL-like window functions in PANDAS

Iterating through a dataframe with info from another dataframe

I have a question that I think is more about logic than about coding. My goal is to calculate how many Kilometers a truck is loaded and charging.
I have two Dataframes
Lets call the first one trips:
Date Licence City State KM
01/05/2019 AAA-1111 Sao Paulo SP 10
02/05/2019 AAA-1111 Santos SP 10
03/05/2019 AAA-1111 Rio de Janeiro RJ 20
04/05/2019 AAA-1111 Sao Paulo SP 15
01/05/2019 AAA-2222 Curitiba PR 20
02/05/2019 AAA-2222 Sao Paulo SP 25
Lets call the second one invoice
Code Date License Origin State Destiny UF Value
A1 01/05/2019 AAA-1111 Sao Paulo SP Rio de Janeiro RJ 10.000,00
A2 01/05/2019 AAA-2222 Curitiba PR Sao Paulo SP 15.000,00
What I need to get is:
Date Licence City State KM Code
01/05/2019 AAA-1111 Sao Paulo SP 10 A1
02/05/2019 AAA-1111 Santos SP 10 A1
03/05/2019 AAA-1111 Rio de Janeiro RJ 20 A1
04/05/2019 AAA-1111 Sao Paulo SP 15 Nan
01/05/2019 AAA-2222 Curitiba PR 20 A2
02/05/2019 AAA-2222 Sao Paulo SP 25 A2
As I said, is more a question of logic. The truck got its cargo in the initial point that is São Paulo. How can I iterate through the rows knowing that it passed through Santos loaded and then went to Rio de Janeiro if I don´t have the date when the cargo was delivered?
tks

Assume the rows in the first dataframe(df1) are sorted, here is what I would do:
Note: below I am using df1 for trips and df2 for invoice
Left join with the df1 (left) and df2 (right) using as much information that is valid for matching two dataframes, so that we can find rows in df1 which are Origin of the trips. In my test, I am using the fields: ['Date', 'License', 'City', 'State'], save the result in a new dataframe df3
df3 = df1.merge(df2[df2.columns[:6]].rename(columns={'Origin':'City'})
, on = ['Date', 'License', 'City', 'State']
, how = 'left'
)
fill the NULL values in df3.Desitiny with ffill()
df3['Destiny'] = df3.Destiny.ffill()
setup the group label by the following flag:
g = (~df3.Code.isnull() | (df3.shift().City == df3.Destiny)).cumsum()
Note: I added df3['g'] in the above picture for reference
update df3.Code using ffill() based on the above group labels.
df3['Code'] = df3.groupby(g).Code.ffill()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to merge two data frames in pandas? - python

Use pd.merge pd.merge(left=df1, right=df2, on='tweetid', how='inner') Because you take the inner, non-overlapping parts will be thrown away. on='tweetid' merges it on tweetid.

There is probably an extra character somewhere at the beginning of your file. Are you reading the data from a csv file? Post the source code of how you are reading the data. or name your columns on both dataframe. df_tweets.columns = ("tweetid", "sent")

Related

Python pandas 'reverse' split in a dataframe

Pandas Groupby Newbie Conundrum

Extract the YYYY year from two string columns and put it in a new column, keeping NaN values

How to create a duplicate flag (column) that counts duplicate rows based on two columns?

Iterating through a dataframe with info from another dataframe

Categories

Resources