I was trying to add a new Column to my dataset but when i did the column only had 1 index
is there a way to make one value be in al indexes in a column
import pandas as pd
df = pd.read_json('file_1.json', lines=True)
df2 = pd.read_json('file_2.json', lines=True)
df3 = pd.concat([df,df2])
df3 = df.loc[:, ['renderedContent']]
görüş_column = ['Milet İttifakı']
df3['Siyasi Yönelim'] = görüş_column
As per my understanding, this could be your possible solution:-
You have mentioned these lines of code:-
df3 = pd.concat([df,df2])
df3 = df.loc[:, ['renderedContent']]
You can modify them into
df3 = pd.concat([df,df2],axis=1) ## axis=1 means second dataframe will add to columns, default value is axis=0 which adds to the rows
Second point is,
df3 = df3.loc[:, ['renderedContent']]
I think you want to write this one , instead of df3=df.loc[:,['renderedContent']].
Hope it will solve your problem.
Related
Working with a CSV file in PyCharm. I want to delete the automatically-generated index column. When I print it, however, the answer I get in the terminal is "None". All the answers by other users indicate that the reset_index method should work.
If I just say "df = df.reset_index(drop=True)" it does not delete the column, either.
import pandas as pd
df = pd.read_csv("music.csv")
df['id'] = df.index + 1
cols = list(df.columns.values)
df = df[[cols[-1]]+cols[:3]]
df = df.reset_index(drop=True, inplace=True)
print(df)
I agree with #It_is_Chris. Also,
This is not true because return is None:
df = df.reset_index(drop=True, inplace=True)
It's should be like this:
df.reset_index(drop=True, inplace=True)
or
df = df.reset_index(drop=True)
Since you said you're trying to "delete the automatically-generated index column" I could think of two solutions!
Fist solution:
Assign the index column to your dataset index column. Let's say your dataset has already been indexed/numbered, then you could do something like this:
#assuming your first column in the dataset is your index column which has the index number of zero
df = pd.read_csv("yourfile.csv", index_col=0)
#you won't see the automatically-generated index column anymore
df.head()
Second solution:
You could delete it in the final csv:
#To export your df to a csv without the automatically-generated index column
df.to_csv("yourfile.csv", index=False)
I have a pandas dataframe I read from a csv file with df = pd.read_csv("data.csv"):
date,location,value1,value2
2020-01-01,place1,1,2
2020-01-02,place2,5,8
2020-01-03,place2,2,9
I also have a dataframe with corrections df_corr = pd.read_csv("corrections .csv")
date,location,value
2020-01-02,place2,-1
2020-01-03,place2,2
How do I apply these corrections where date and location match to get the following?
date,location,value1,value2
2020-01-01,place1,1,2
2020-01-02,place2,4,8
2020-01-03,place2,4,9
EDIT:
I got two good answers and decided to go with set_index(). Here is how I did it 'non-destructively'.
df = pd.read_csv("data.csv")
df_corr = pd.read_csv("corr.csv")
idx = ['date', 'location']
df_corrected = df.set_index(idx).add(
df_corr.set_index(idx).rename(
columns={"value": "value1"}), fill_value=0
).astype(int).reset_index()
It looks like you want to join the two DataFrames on the date and location columns. After that its a simple matter of applying the correction by adding the value1 and value columns (and finally dropping the column containing the corrections).
# Join on the date and location columns.
df_corrected = pd.merge(df, df_corr, on=['date', 'location'], how='left')
# Apply the correction by adding the columns.
df_corrected.value1 = df_corrected.value1 + df_corrected.value
# Drop the correction column.
df_corrected.drop(columns='value', inplace=True)
Set date and location as index in both dataframes, add the two and fillna
df.set_index(['date','location'], inplace=True)
df1.set_index(['date','location'], inplace=True)
df['value1']=(df['value1']+df1['value']).fillna(df['value1'])
I have a Dataframe df1 with the columns. I need to compare the headers of columns in df1 with a list of headers from df2
df1 =['a','b','c','d','f']
df2 =['a','b','c','d','e','f']
I need to compare the df1 with df2 and if any missing columns, I need to add them to df1 with blank values.
I tried concat and also append and both didn't work. with concat, I'm not able to add the column e and with append, it is appending all the columns from df1 and df2. How would I get only missing column added to df1 in the same order?
df1_cols = df1.columns
df2_cols = df2._combine_match_columns
if (df1_cols == df2_cols).all():
df1.to_csv(path + file_name, sep='|')
else:
print("something is missing, continuing")
#pd.concat([my_df,flat_data_frame], ignore_index=False, sort=False)
all_list = my_df.append(flat_data_frame, ignore_index=False, sort=False)
I wanted to see the results as
a|b|c|d|e|f - > headers
1|2|3|4||5 -> values
pandas.DataFrame.align
df1.align(df2, axis=1)[0]
By default this does an 'outer' join
By specifying axis=1 we focus on columns
This returns a tuple of both an aligned df1 and df2 with the calling dataframe being the first element. So I grab the first element with [0]
pandas.DataFrame.reindex
df1.reindex(columns=df1.columns | df2.columns)
You can treat pandas.Index objects like sets most of the time. So df1.columns | df2.columns is the union of those two index objects. I then reindex using the result.
Lets first create the two dataframes as:
import pandas as pd, numpy as np
df1 = pd.DataFrame(np.random.random((5,5)), columns = ['a','b','c','d','f'])
df2 = pd.DataFrame(np.random.random((5,7)), columns = ['a','b','c','d','e','f','g'])
Now add those columns of df2 to df1 (with nan values), which are not in df1:
for i in list(df2):
if i not in list(df1):
df1[i] = np.nan
Now display the columns of df1 alphabetically:
df1 = df1[sorted(list(df1))]
enter image description hereWhen applying the below code , i am getting NAN values in the entire column of QSTS_ID
df['QSTS_ID'] = df['QSTS_ID'].str.split('.',expand=True)
df
I want to copy the entire QSTS_ID column and append it at the end. I also have to delimit it by fullstop and apply new headers
Problem is if add parameter expand=True it return DataFrame with one or more columns, so assign return NaNs.
Solution is add new columns with join or concat to original DataFrame, also add_prefix is for change new columns names:
df = df.join(df['QSTS_ID'].str.split('.',expand=True).add_prefix('QSTS_ID_'))
df = pd.concat([df, df['QSTS_ID'].str.split('.',expand=True).add_prefix('QSTS_ID_')], axis=1)
If want also remove original column:
df = df.join(df.pop('QSTS_ID').str.split('.',expand=True).add_prefix('QSTS_ID_'))
df = pd.concat([df,
df.pop('QSTS_ID').str.split('.',expand=True).add_prefix('QSTS_ID_')], axis=1)
Sample:
df = pd.DataFrame({
'QSTS_ID':['val_k.lo','val2.s','val3.t'],
'F':list('abc')
})
df1 = df['QSTS_ID'].str.split('.',expand=True).add_prefix('QSTS_ID_')
df = df.join(df1)
print (df)
QSTS_ID F QSTS_ID_0 QSTS_ID_1
0 val_k.lo a val_k lo
1 val2.s b val2 s
2 val3.t c val3 t
#check columns names of new columns
print (df1.columns)
I just want to create a dataFrame that is updated with itself(df3), adding rows from other dataFrames (df1,df2) based on an index ("ID").
When adding a new dataFrame if an overlap index is found, update the data. If it is not found, add the data including the new index.
df1 = pd.DataFrame({"Proj. Num" :["A"],'ID':[000],'DATA':["NO_DATA"]})
df1 = df1.set_index(["ID"])
df2 = pd.DataFrame({"Proj. Num" :["B"],'ID':[100],'DATA':["OK"], })
df2 = df2.set_index(["ID"])
df3 = pd.DataFrame({"Proj. Num" :["B"],'ID':[100],'DATA':["NO_OK"], })
df3 = df3.set_index(["ID"])
#df3 = pd.concat([df1,df2, df3]) #Concat,merge,join???
df3
I have tried concatenate with _verify_integrity=False_ but it just gives an error, and I think there is a more simple/nicer way to do it.
Solution with concat + Index.duplicated for boolean mask and filter by boolean indexing:
df3 = pd.concat([df1, df2, df3])
df3 = df3[~df3.index.duplicated()]
print (df3)
DATA Proj. Num
ID
0 NO_DATA A
100 OK B
Another solution by comment, thank you:
df3 = pd.concat([df3,df1])
df3 = df3[~df3.index.duplicated(keep='last')]
print (df3)
DATA Proj. Num
ID
100 NO_OK B
0 NO_DATA A
You can concatenate all the dataframes along the index; group by index and decide which element to keep of the group sharing the same index.
From your question it looks like you want to keep the last (most updated) element with the same index. It is then important the order in which you pass the dataframes in the pd.concat function.
For a list of other methods, see here.
res = pd.concat([df1, df2, df3], axis = 0)
res.groupby(res.index).last()
Which gives:
DATA Proj. Num
ID
0 NO_DATA A
100 NO_OK B
#update existing rows
df3.update(df1)
#append new rows
df3 = pd.concat([df3,df1[~df1.index.isin(df3.index)]])
#update existing rows
df3.update(df2)
#append new rows
df3 = pd.concat([df3,df2[~df2.index.isin(df3.index)]])
Out[2438]:
DATA Proj. Num
ID
100 OK B
0 NO_DATA A