I have the following data frames:
print(df_a)
mukey DI PI
0 100000 35 14
1 1000005 44 14
2 1000006 44 14
3 1000007 43 13
4 1000008 43 13
print(df_b)
mukey niccdcd
0 190236 4
1 190237 6
2 190238 7
3 190239 4
4 190240 7
When I try to join these data frames:
join_df = df_a.join(df_b, on='mukey', how='left')
I get the error:
*** ValueError: columns overlap but no suffix specified: Index([u'mukey'], dtype='object')
Why is this so? The data frames do have common 'mukey' values.
Your error on the snippet of data you posted is a little cryptic, in that because there are no common values, the join operation fails because the values don't overlap it requires you to supply a suffix for the left and right hand side:
In [173]:
df_a.join(df_b, on='mukey', how='left', lsuffix='_left', rsuffix='_right')
Out[173]:
mukey_left DI PI mukey_right niccdcd
index
0 100000 35 14 NaN NaN
1 1000005 44 14 NaN NaN
2 1000006 44 14 NaN NaN
3 1000007 43 13 NaN NaN
4 1000008 43 13 NaN NaN
merge works because it doesn't have this restriction:
In [176]:
df_a.merge(df_b, on='mukey', how='left')
Out[176]:
mukey DI PI niccdcd
0 100000 35 14 NaN
1 1000005 44 14 NaN
2 1000006 44 14 NaN
3 1000007 43 13 NaN
4 1000008 43 13 NaN
The .join() function is using the index of the passed as argument dataset, so you should use set_index or use .merge function instead.
Please find the two examples that should work in your case:
join_df = LS_sgo.join(MSU_pi.set_index('mukey'), on='mukey', how='left')
or
join_df = df_a.merge(df_b, on='mukey', how='left')
This error indicates that the two tables have one or more column names that have the same column name.
The error message translates to: "I can see the same column in both tables but you haven't told me to rename either one before bringing them into the same table"
You either want to delete one of the columns before bringing it in from the other on using del df['column name'], or use lsuffix to re-write the original column, or rsuffix to rename the one that is being brought in.
df_a.join(df_b, on='mukey', how='left', lsuffix='_left', rsuffix='_right')
The error indicates that the two tables have the 1 or more column names that have the same column name.
Anyone with the same error who doesn't want to provide a suffix can rename the columns instead. Also make sure the index of both DataFrames match in type and value if you don't want to provide the on='mukey' setting.
# rename example
df_a = df_a.rename(columns={'a_old': 'a_new', 'a2_old': 'a2_new'})
# set the index
df_a = df_a.set_index(['mukus'])
df_b = df_b.set_index(['mukus'])
df_a.join(df_b)
Mainly join is used exclusively to join based on the index,not on the attribute names,so change the attributes names in two different dataframes,then try to join,they will be joined,else this error is raised
Related
in my code I've generated a range of dates using pd.date_range in an effort to compare it to a column of dates read in from excel using pandas. The generated range of dates is refered to as "all_dates".
all_dates=pd.date_range(start='1998-12-31', end='2020-06-23')
for i, date in enumerate(period): # where 'Period' is the column of excel dates
if date==all_dates[i]: # loop until date from excel doesn't match date from generated dates
continue
else:
missing_dates_stock.append(i) # keep list of locations where dates are missing
stock_data.insert(i,"NaN") # insert 'NaN' where missing date is found
This results in TypeError: argument of type 'Timestamp' is not iterable. How can I make the data types match such that I can iterate and compare them? Apologies as I am not very fluent in Python.
I think you are trying to create a NaN row if the date does not exist in the excel file.
Here's a way to do it. You can use the df.merge option.
I am creating df1 to simulate the excel file. It has two columns sale_dt and sale_amt. If the sale_dt does not exist, then we want to create a separate row with NaN in the columns. To ensure we simulate it, I am creating a date range from 1998-12-31 through 2020-06-23 skipping 4 days in between. So we have a dataframe with 4 missing date between each two rows. The solution should create 4 dummy rows with the correct date in ascending order.
import pandas as pd
import random
#create the sales dataframe with missing dates
df1 = pd.DataFrame({'sale_dt':pd.date_range(start='1998-12-31', end='2020-06-23', freq='5D'),
'sale_amt':random.sample(range(1, 2000), 1570)
})
print (df1)
#now create a dataframe with all the dates between '1998-12-31' and '2020-06-23'
df2 = pd.DataFrame({'date':pd.date_range(start='1998-12-31', end='2020-06-23', freq='D')})
print (df2)
#now merge both dataframes with outer join so you get all the rows.
#i am also sorting the data in ascending order so you can see the dates
#also dropping the original sale_dt column and renaming the date column as sale_dt
#then resetting index
df1 = (df1.merge(df2,left_on='sale_dt',right_on='date',how='outer')
.drop(columns=['sale_dt'])
.rename(columns={'date':'sale_dt'})
.sort_values(by='sale_dt')
.reset_index(drop=True))
print (df1.head(20))
The original dataframe was:
sale_dt sale_amt
0 1998-12-31 1988
1 1999-01-05 1746
2 1999-01-10 1395
3 1999-01-15 538
4 1999-01-20 1186
... ... ...
1565 2020-06-03 560
1566 2020-06-08 615
1567 2020-06-13 858
1568 2020-06-18 298
1569 2020-06-23 1427
The output of this will be (first 20 rows):
sale_amt sale_dt
0 1988.0 1998-12-31
1 NaN 1999-01-01
2 NaN 1999-01-02
3 NaN 1999-01-03
4 NaN 1999-01-04
5 1746.0 1999-01-05
6 NaN 1999-01-06
7 NaN 1999-01-07
8 NaN 1999-01-08
9 NaN 1999-01-09
10 1395.0 1999-01-10
11 NaN 1999-01-11
12 NaN 1999-01-12
13 NaN 1999-01-13
14 NaN 1999-01-14
15 538.0 1999-01-15
16 NaN 1999-01-16
17 NaN 1999-01-17
18 NaN 1999-01-18
19 NaN 1999-01-19
I am facing a weird scenario.
I have a data frame with having 3 largest scores for unique row like this:
id rid code score
1 9 67 43
1 8 87 22
1 4 32 20
2 3 56 43
3 10. 22 100
3. 5 67. 50
Here id column is same but row wise it is different.
I want to make my data frame like this:
id first_code second_code third_code
1 67 87 32
2. 56. none. none
3 22. 67. none
So I have made my dataframe which is showing highest top 3 scores. If there is not top 3 value I am taking top 2 or the only value which is the score. So depending on score value, I want to re-arrange the code column into three different columns as example first_code is representing the highest_score, second_score is representing second-highest, third_code is representing the third highest value. If not found then I will make those blanks.
Kindly help me to solve this.
Use GroupBy.cumcount for counter, create MultiIndex and reshape by Series.unstack:
df = df.set_index(['id',df.groupby('id').cumcount()])['code'].unstack()
df.columns=['first_code', 'second_code', 'third_code']
df = df.reset_index()
print (df)
id first_code second_code third_code
0 1.0 67.0 87.0 32.0
1 2.0 56.0 NaN NaN
2 3.0 22.0 67.0 NaN
Btw, cumcount should be used also in previous code for filter top3 values.
I have a CSV, but the rows have different number of columns, because in some rows, some values are missing. So there is no index. The "meaning" of each value is at the moment encoded by a prefix to the value. I need to clean my CSV so as to create a new one, that only holds values of certain columns, based on the prefix.
Looks like that:
001234;aA431;cFM33;jJE LE (3);xABCD;421;
004321;aB432;cPD99;433
006543;aC332;cHR31;x4231;499
The new CSV should have a header, its name can be the prefix (first letter) of the column:
0;a;c;4
01234;A431;FM33;21
04321;B432;PD99;33
06543;C332;HR31;99
I am starting to work with python pandas, so any hints in that direction would be esp. welcome.
You can use
df1=df.astype(str).copy()
cols = df1.iloc[0].str[0].tolist()
df1=df1.apply(lambda x: x.str[1:])
df1.columns = cols
input
A B C D E F
0 1234 aA431 cFM33 jJE LE (3) xABCD 421.0
1 4321 aB432 cPD99 433 NaN NaN
2 6543 aC332 cHR31 x4231 499 NaN
output
print(df1)
1 a c j x 4
0 234 A431 FM33 JE LE (3) ABCD 21.0
1 321 B432 PD99 33 an an
2 543 C332 HR31 4231 99 an
print(df1)
I have a dataframe, with recordings of statistics in multiple columns.
I have a list of the column names: stat_columns = ['Height', 'Speed'].
I want to combine the data to get one row per id.
The data comes sorted with the newest records on the top. I want the most recent data, so I must use the first value of each column, by id.
My dataframe looks like this:
Index id Height Speed
0 100007 8.3
1 100007 54
2 100007 8.6
3 100007 52
4 100035 39
5 100014 44
6 100035 5.6
And I want it to look like this:
Index id Height Speed
0 100007 54 8.3
1 100014 44
2 100035 39 5.6
I have tried a simple groupby myself:
df_stats = df_path.groupby(['id'], as_index=False).first()
But this seems to only give me a row with the first statistic found.
For me your solution working, maybe is necessary replace empty values to NaNs:
df_stats = df_path.replace('',np.nan).groupby('id', as_index=False).first()
print (df_stats)
id Index Height Speed
0 100007 0 54.0 8.3
1 100014 5 44.0 NaN
2 100035 4 39.0 5.6
I currently have data in the following format in a dataframe:
metric__name sample sample_date
0 ga:visitBounceRate 100 2012-11-13
1 ga:uniquePageviews 20 2012-11-13
2 ga:newVisits 19 2012-11-13
3 ga:visits 20 2012-11-13
4 ga:percentNewVisits 95 2012-11-13
5 ga:pageviewsPerVisit 1 2012-11-13
6 ga:pageviews 20 2012-11-13
7 ga:visitBounceRate 72 2012-11-14
8 ga:uniquePageviews 63 2012-11-14
9 ga:newVisits 39 2012-11-14
That being said, I am trying to break out the metric__name column into something like this.
ga:visitBounceRate ga:uniquePageviews ga:newVisits etc...
sample_date
2012-11-13 100 20 19 etc...
I am doing the following to get my desired result.
df.pivot(index='sample_dates', columns='metric__name', values='samples')
All I keep getting is index contains multiple values which it indeed does, but why wouldn't it understand that there are similar and map them to the same line as I did in my desired output?
Use pivot_table (which doesn't throw this exception):
In [11]: df.pivot_table('sample', 'sample_date', 'metric__name')
Out[11]:
metric__name ga:newVisits ga:pageviews ga:pageviewsPerVisit ga:percentNewVisits ga:uniquePageviews ga:visitBounceRate ga:visits
sample_date
2012-11-13 19 20 1 95 20 100 20
2012-11-14 39 NaN NaN NaN 63 72 NaN
It accepts an aggregation function (by default is mean):
aggfunc : function, default numpy.mean, or list of functions
If list of functions passed, the resulting pivot table will have hierarchical columns
whose top level are the function names (inferred from the function objects themselves)
Regarding the difference between the two, I think pivot just does reshaping (and throws an error if there is a problem), whereas pivot_table offers more advanced functionality, aka "spreadsheet-style pivot tables".