having trouble removing the first column in data frame - python

I am trying to remove the unmanned column in my data frame.
Here is a snip of the dataframe
How do i get rid of the unnamed column in pink
here is my code;
df = pd.read_csv("/nalt_labels_DATA/nalt_altlabels.csv")
df['randomint'] = np.random.randint(100, 500, size=len(df))
df['Nalt_URI_suffix'] = df.loc[:,'NALT_URI'].astype(str) + '_' + df.loc[:,'randomint'].astype(str)
df.drop(['randomint', 'NALT_URI'], axis=1)
df1 = df[['Nalt_URI_suffix', 'Label']]
df1.to_csv("/nalt_labels_DATA/nalt_altlabels_suffix.csv")
csv.writer(open("/nalt_labels_DATA/nalt_altlabels_suffix.tsv", 'w+'), delimiter='\t').writerows(csv.reader(open("/nalt_labels_DATA/nalt_altlabels_suffix.csv")))

Related

Rearranging with pandas melt

I am trying to rearrange a DataFrame. Currently, I have 1035 rows and 24 columns, one for each hour of the day. I want to make this a array with 1035*24 rows. If you want to see the data it can be extracted from the following JSON file:
url = "https://www.svk.se/services/controlroom/v2/situation?date={}&biddingArea=SE1"
svk = []
for i in parsing_range_svk:
data_json_svk = json.loads(urlopen(url.format(i)).read())
svk.append([v["y"] for v in data_json_svk["Data"][0]["data"]])
This is the code I am using to rearrange this data, but it is not doing the job. The first obeservation is in the right place, then it starts getting messy. I have not been able to figure out where each observation goes.
svk = pd.DataFrame(svk)
date_start1 = datetime(2020, 1, 1)
date_range1 = [date_start1 + timedelta(days=x) for x in range(1035)]
date_svk = pd.DataFrame(date_range1, columns=['date'])
svk['date'] = date_svk['date']
svk.drop(24, axis=1, inplace=True)
consumption_svk_1 = (svk.melt('date', value_name='SE1_C')
.assign(date = lambda x: x['date'] +
pd.to_timedelta(x.pop('variable').astype(float), unit='h'))
.sort_values('date', ignore_index=True))

divide the row into two rows after several columns

I have CSV file and I try to split my row into many rows if it contains more than 4 columns
Example:-
enter image description here
Expected Output:
enter image description here
So there are way to do that in pandas or python
Sorry if this is a simple question
When there are two columns with the same name in CSV file, the pandas dataframe automatically appends an integer value to the duplicate column name
for example:
This CSV file :
Will become this :
df = pd.read_csv("Book1.csv")
df
Now to solve your question, lets consider the above dataframe as the input dataframe.
Try this :
cols = df.columns.tolist()
cols.remove('id')
start = 0
end = 4
new_df = []
final_cols = ['id','x1','y1','x2','y2']
while start<len(cols):
if end>len(cols):
end = len(cols)
temp = cols[start:end]
start = end
end = end+4
temp_df = df.loc[:,['id']+temp]
temp_df.columns = final_cols[:1+len(temp)]
if len(temp)<4:
temp_df[final_cols[1+len(temp):]] = None
print(temp_df)
new_df.append(temp_df)
pd.concat(new_df).reset_index(drop = True)
Result:
You can first set the video column as index then concat your remaining every 4 columns into a new dataframe. At last, reset index to get video column back.
df.set_index('video', inplace=True)
dfs = []
for i in range(len(df.columns)//4):
d = df.iloc[:, range(i*4,i*4+4)]
dfs.append(d.set_axis(['x_center', 'y_center']*2, axis=1))
df_ = pd.concat(dfs).reset_index()
I think the following list comprehension should work, but it gives an positional indexing error on my machine and I don't know why
df_ = pd.concat([df.iloc[: range(i*4, i*4+4)].set_axis(['x_center', 'y_center']*2, axis=1) for i in range(len(df.columns)//4)])
print(df_)
video x_center y_center x_center y_center
0 1_1 31.510973 22.610222 31.383655 22.488293
1 1_1 31.856295 22.830109 32.016905 22.948702
2 1_1 32.011684 22.990689 31.933356 23.004779

Loops in Python : how to apply same set of code in a loop

Thank you for all your help in my previous questions.
Now it leads me to my final and most difficult task. Let me break it down:
I have a file named:
"PCU1-160321.csv"
Then I run the following codes (see below) that helps me to do all sorts of things I need to do.
This leaves me to the most difficult task which is to do again and again the same set of code (see below), to other other 23 files, named:
PCU2-160321.csv
PCU3-160321.csv
...
PCU24-160321.csv
Eg. I wish to call df2 for PCU2-160321.csv, df3 for PCU3-160321.csv etc. (Hence, loop...)
Or is there a better looping method?
Below I attach the visualisation:
Thank you very much.
UPDATE:
I did try something like this , but it didn't work...
#Assign file names
file1 = 'PCU1-160321.csv'
file_out1 = file1+'15min.csv'
#Read csv file and assign header
df1 = pd.read_csv(gdrive_url+file1+'.csv', sep=';', names=['Date','Time_Decimal','Parameter','Value'])
#To split the Time_Decimal column and concatenate the second decimals back into Time column, and rearrange the column order
df1[['Time','DecimalSecond']] = df1.Time_Decimal.str.split(".",expand=True)
df1 = df1[['Date', 'Time', 'DecimalSecond', 'Parameter','Value']]
#Split AM and concatenate DecimalSecond
df1[['Time', 'AMPM']] = df.Time.str.split(" ", expand=True)
df1 = df1[['Date', 'Time', 'AMPM', 'DecimalSecond', 'Parameter','Value']]
df1['Time'] = df1['Time'].map(str) + '.' + df1['DecimalSecond'].map(str) + ' ' + df['AMPM'].map(str)
df1['Timestamp'] = df1['Date'].map(str) + ' ' + df1['Time']
df1 = df1[['Timestamp', 'Parameter','Value']]
#Assign index and set index as timestamp object
df1['Timestamp'] = pd.to_datetime(df1['Timestamp'])
df1 = df1.set_index('Timestamp')
#Assigning parameters I want to filter out of the df above.
Parameter1 = 'DCA_L_1'
Parameter2 = 'DCA_L_2'
#Filtering based on the variables defined above, contains whatevr parameter I need.
df1_param1 = df1[df1['Parameter'].str.contains(Parameter1)]
df1_param2 = df1[df1['Parameter'].str.contains(Parameter2)]
#Renaming the columns header
df1_param1.columns = ['Par1','Val1']
df1_param2.columns = ['Par2','Val2']
#Obtain exact name of parameter into a string, then we use this for the new df top row
par1 = df1_param1.head(1)['Par1'].values[0]
par2 = df1_param2.head(1)['Par2'].values[0]
#Downsampling 15 mins
df1_param1 = df1_param1.resample('15min').mean()
df1_param2 = df1_param2.resample('15min').mean()
#Concatenating all the dfs - except empty df, df_ppc_param4
df1_concat = pd.concat([df1_param1, df1_param2], axis=1)
#Select Values
df1_concat = df1_concat[['Val1','Val2']
#Rename columns to the new df_concat
df1_concat.columns = [par1,par2]
#To save output as csv
df_concat.to_csv(gdrive_url_out+file_out+'.csv', index=True)

Update dataframe rows trough loop

I have a dataframe and I want to create some new columns that contain the growth of the original columns.
First, I append the new columns to the dataframe, filling them with NaN values.
Then, for every row I check if the previous row corresponds to the previous year, and if it does I want to fill the new column with the growth of the variable. Otherwise I just leave the NaN value.
Here is my code:
for index, row in df.iterrows():
if df.loc[index,'year'] == df.loc[index - 1, 'year'] + 1 and df.loc[index,'name'] == df.loc[index - 1, 'name']:
df.loc[index,k:] = (df.loc[index,1:k-1]/df.loc[index-1,1:k-1]) - 1
Where k is the column index of the first new "growth" column that I created.
The problem with this code is that it leaves the new columns with NaN values, without making any change. Did I do anything wrong?
Thanks
df.sort_values('year', inplace = True)
growth_cols = [<your-growth-cols>]
new_cols = [x + "_growth" for x in growth_cols]
growth_df = df[growth_cols] / df[growth_cols].shift(1)
growth_df.rename(columns = dict(zip(growth_cols, new_cols)), inplace = True)
df = pd.concat([df, growth_df], axis =1)
df['gap'] = df.year.diff()
for col in new_cols:
df[col] = df[col] * df['gap']
df[col].replace(0, np.nan, inplace = True)
df.drop('gap', axis = 1, inplace = True)
EDIT (based on updated question):
You would need to change the line
df['gap'] = df.year.diff()
to:
df['gap'] = df.groupby('name').diff()

QtableWidget to DataFrame [duplicate]

I have a QTableWidget in editable mode in which user puts in integer input , how can I generate a list of data entered in this table so as to perform operations on it , here is my manual code for that:
def dataframe_generation_from_table(self,table):
number_of_rows = table.rowCount()
number_of_columns = table.columnCount()
tmp_df = pd.DataFrame({ 'Date' : [] , str(self.final_lvl_of_analysis) :[], 'Value': []})
for i in range(0,number_of_rows):
for j in range(0,number_of_columns):
tmp_item = table.item(i,j)
tmp_df2 = pd.DataFrame( { 'Date' : [pd.to_datetime(table.horizontalHeaderItem(j).data())] , str(self.final_lvl_of_analysis) :[ str(table.verticalHeaderItem(i).data())], 'Value': [float(tmp_item.data(0))]})
print tmp_df2
tmp_df.update(tmp_df2, join = 'left', overwrite = False)
return tmp_df
Also , I am using the following code for QTableWidget generation:
self.pd_table = QtGui.QTableWidget(self.groupBox_19)
self.pd_table.setObjectName(_fromUtf8("pd_table"))
self.pd_table.setColumnCount(0)
self.pd_table.setRowCount(0)
My specs are : pandas 0.18.1 , PyQt 4 and Python 2.7
I think you're overcomplicating it a little with the updates/joins. The simplest approach is to create the full-size DataFrame first (filled with NaN) and then assign the data to this:
def dataframe_generation_from_table(self,table):
number_of_rows = table.rowCount()
number_of_columns = table.columnCount()
tmp_df = pd.DataFrame(
columns=['Date', str(self.final_lvl_of_analysis), 'Value'], # Fill columnets
index=range(number_of_rows) # Fill rows
)
for i in range(number_of_rows):
for j in range(number_of_columns):
tmp_df.ix[i, j] = table.item(i, j).data()
return tmp_df
The above code assigns data to it's location by the numerical index, so position 1,1 in the QtTableWidget will end up at 1,1 in the DataFrame. This way you don't need to worry about the column headers when moving data. If you want to change the column names you can do that when creating the DataFrame, changing the values passed into the columns= parameter.
If you want to change a column to DateTime format, you should be able to do this in a single operation after the loop with:
tmp_df['Date'] = pd.to_datetime( tmp_df['Date'] )
The change from .data() to .text() eliminated the ValueError.
def saveFile(self):
df = pd.DataFrame()
savePath = QtGui.QFileDialog.getSaveFileName(None, "Blood Hound",
"Testing.csv", "CSV files (*.csv)")
rows = self.tableWidget.rowCount()
columns = self.tableWidget.columnCount()
for i in range(rows):
for j in range(columns):
df.loc[i, j] = str(self.tableWidget.item(i, j).text())
df.to_csv((savePath), header = None, index = 0)
# creates a new df from qtables dimensions,
# copies qtable (data & headers) to the df and returns the df
#staticmethod
def write_qtable_to_df(table):
col_count = table.columnCount()
row_count = table.rowCount()
headers = [str(table.horizontalHeaderItem(i).text()) for i in range(col_count)]
# df indexing is slow, so use lists
df_list = []
for row in range(row_count):
df_list2 = []
for col in range(col_count):
table_item = table.item(row,col)
df_list2.append('' if table_item is None else str(table_item.text()))
df_list.append(df_list2)
df = pandas.DataFrame(df_list, columns=headers)
return df

Categories