adding data to an existing empty dataframe containing only column names - python

How can I add data to an existing empty column in a dataframe?
I have an empty dataframe with column names (stock tickers)
I am trying to add data to each stock, basically, populate the dataframe column by column, from left to right based on the header name.
I am pulling the data from another CSV file which looks like this (CSV file name = column name in the dataframe Im trying to populate):
PS aditional issue may arise due to the length of data available for each stock, eg. I may have a list of 10 values for the first stock, 0 for the second, and 25 for third. I plan to save this in a CSV, so perhaps it could not cause too big of an issue.
I have tried the following idea but without luck. any suggestions are welcome.
import pandas as pd
import os
path = 'F:/pathToFiles'
Russell3k_Divs = 'Russel3000-Divs/'
Russell3k_Tickers = 'Russell-3000-Stock-Tickers-List.csv'
df_tickers = pd.read_csv(path + Russell3k_Tickers)
divFls = os.listdir(path + Russell3k_Divs)
for i in divFls:
df = pd.read_csv(path + Russell3k_Divs + i)
Div = df['Dividends']
i = i[0].split('.')
df_tickers[i] = df_tickers.append(Div)
print(df_tickers)
break

import pandas as pd
import os
from tqdm import tqdm
path = 'F:/pathToFiles'
Russell3k_Divs = 'Russel3000-Divs/'
Russell3k_Tickers = 'Russell-3000-Stock-Tickers-List.csv'
df_tickers = pd.DataFrame()
divFls = os.listdir(path + Russell3k_Divs)
for i in tqdm(divFls):
df = pd.read_csv(path + Russell3k_Divs + i)
i = i.split('.')[0]
df[str(i)] = df['Date']
df_tickers = df_tickers.join(df[str(i)], how='outer')
df_tickers.to_csv('Russell-3000-Stock-Tickers-List1.csv', encoding='utf-8', index=False)
This answer was posted as an edit to the question adding data to an existing empty dataframe containing only column names by the OP Mr.Riply under CC BY-SA 4.0.

Related

Automatic transposing Excel user data in a Pandas Dataframe

I have some big Excel files like this (note: other variables are omitted for brevity):
and would need to build a corresponding Pandas DataFrame with the following structure.
I am trying to develop a Pandas code for, at least, parsing the first column and transposing the id and the full of each user. Could you help with this?
The way that I would tackle it, and I am assuming there are likely to be more efficient ways, is to import the excel file into a dataframe, and then iterate through it to grab the details you need for each line. Store that information in a dictionary, and append each formed line into a list. This list of dictionaries can then be used to create the final dataframe.
Please note, I made the following assumptions:
Your excel file is named 'data.xlsx' and in the current working directory
The index next to each person increments by one EVERY time
All people have a position described in brackets next to the name
I made up the column names, as none were provided
import pandas as pd
# import the excel file into a dataframe (df)
filename = 'data.xlsx'
df = pd.read_excel(filename, names=['col1', 'col2'])
# remove blank rows
df.dropna(inplace=True)
# reset the index of df
df.reset_index(drop=True, inplace=True)
# initialise the variables
counter = 1
name_pos = ''
name = ''
pos = ''
line_dict = {}
list_of_lines = []
# iterate through the dataframe
for i in range(len(df)):
if df['col1'][i] == counter:
name_pos = df['col2'][i].split(' (')
name = name_pos[0]
pos = name_pos[1].rstrip(name_pos[1][-1])
p_index = counter
counter += 1
else:
date = df['col1'][i].strftime('%d/%m/%Y')
amount = df['col2'][i]
line_dict = {'p_index': p_index, 'name': name, 'position': pos, 'date':date, 'amount': amount}
list_of_lines.append(line_dict)
final_df = pd.DataFrame(list_of_lines)
OUTPUT:

How can I export for loop result to CAV or Excel with pandas?

I have a for loop gets datas from a website and would like to export it to xlsx or csv file.
Normally when I print result of loop I can get all list but when I export that to xlsx file only get last item. Where is the problem can you help?
for item1 in spec:
spec2 = item1.find_all('th')
expl2 = item1.find_all('td')
spec2x = spec2[a].text
expl2x = expl2[a].text
yazim = spec2x + ': ' + expl2x
cumle = yazim
patern = r"(Brand|Series|Model|Operating System|CPU|Screen|MemoryStorage|Graphics Card|Video Memory|Dimensions|Screen Size|Touchscreen|Display Type|Resolution|GPU|Video Memory|Graphic Type|SSD|Bluetooth|USB)"
if re.search(patern, cumle):
speclist = translator.translate(cumle, lang_tgt='tr')
specl = speclist
#print(specl)
import pandas as pd
exp = [{ 'Prospec': specl,},]
df = pd.DataFrame(exp, columns = ['Prospec',])
df.to_excel('output1.xlsx',)
Create an empty list and, at each iteration in your for loop, append a data frame to the list. You will end up with a list of data frames. After the loop, use pd.concat() to create a new data frame by concatenating every element of your list. You can then save the resulting df to an excel file.
Your code would look something like this:
import pandas as pd
df_list = []
for item1 in spec:
......
if re.search(patern, cumle):
....
df_list.append(pd.DataFrame(.....))
df = pd.concat(df_list)
df.to_excel(.....)

Column appended to dataframe coming up empty

I have the following code:
import glob
import pandas as pd
import os
import csv
myList = []
path = "/home/reallymemorable/Documents/git/COVID-19/csse_covid_19_data/csse_covid_19_daily_reports_us/*.csv"
for fname in glob.glob(path):
df = pd.read_csv(fname)
row = df.loc[df['Province_State'] == 'Pennsylvania']
dateFromFilename = os.path.basename(fname).replace('.csv','')
fileDate = pd.DataFrame({'Date': [dateFromFilename]})
myList.append(row.join(fileDate))
concatList = pd.concat(myList, sort=True)
print(concatList)
concatList.to_csv('/home/reallymemorable/Documents/test.csv', index=False, header=True
It goes through a folder of CSVs and grabs a specific row and puts it all in a CSV. The files themselves have names like 10-10-2020.csv. I have some code in there that gets the filename and removes the file extension, so I am left with the date alone.
I am trying to add another column called "Date" that contains the filename for each file.
The script almost works: it gives me a CSV of all the rows I pulled out of the various CSVs, but the Date column itself is empty.
If I do print(dateFromFilename), the date/filename prints as expected (e.g. 10-10-2020).
What am I doing wrong?
I believe join has how=left by default. And your fileDate dataframe has different index than row, so you wouldn't get the date. Instead, do an assignment:
for fname in glob.glob(path):
df = pd.read_csv(fname)
row = df.loc[df['Province_State'] == 'Pennsylvania']
dateFromFilename = os.path.basename(fname).replace('.csv','')
myList.append(row.assign(Date=dateFromFilename))
concatList = pd.concat(myList, sort=True)
Another way is to store the dataframes as a dictionary, then concat:
myList = dict()
for fname in glob.glob(path):
df = pd.read_csv(fname)
row = df.loc[df['Province_State'] == 'Pennsylvania']
dateFromFilename = os.path.basename(fname).replace('.csv','')
myList[dateFromFilename] = row
concatList = pd.concat(myList, sort=True)

Concatenating dataframes adding additional columns

I'm trying to create a combined dataframe from a series of 12 individual CSVs (12 months to combine for the year). All the CSVs have the same format and column layout.
When I first ran it, it appeared to work and I was left with a combined dataframe with 6 columns (as expected). Upon looking at it, I found that the header row was applied as actual data in all the files, so I had some bad rows I needed to eliminate. I could manually make these changes but I'm looking to have the code take care of this automatically.
So to that end, I updated the code so it only read in the first CSV with headers and the remaining CSVs without headers and concatenate everything together. This appears to work BUT I end up with 12 columns instead of 6 with the first 6 columns having NaNs for the first CSV and the last 6 columns having NaNs for the other 11 CSVs, which is obviously NOT what I want (see image below).
The code is similar, I just use the header=None parameter in pd.read_csv() for the 11 CSVs after the first (and I don't use that parameter for the first CSV). Can anyone give me a hint as to why I'm getting 12 columns (with the data placement as described) when I run this code? The layout of the CSV file is shown below.
Appreciate any help.
import pandas as pd
import numpy as np
import os
# Need to include the header row only for the first csv (otherwise header row will be included
# for each read csv, which places improperly formatted rows into the combined dataframe).
totrows = 0
# Get list of csv files to read.
files = os.listdir('c:/data/datasets')
# Read the first csv file, including the header row.
dfSD = pd.read_csv('c:/data/datasets/' + files[0], skip_blank_lines=True)
# Now read the remaining csv files (without header row) and concatenate their values
# into our full Sales Data dataframe.
for file in files[1:]:
df = pd.read_csv('c:/data/datasets/' + file, skip_blank_lines=True, header=None)
dfSD = pd.concat([dfSD, df])
totrows += df.shape[0]
print(file + " == " + str(df.shape[0]) + " rows")
print()
print("TOTAL ROWS = " + str(totrows + pd.read_csv('c:/data/datasets/' + files[0]).shape[0]))
One simple solution is the following.
import pandas as pd
import numpy as np
import os
totrows = 0
files = os.listdir('c:/data/datasets')
dfSD = pd.read_csv('c:/data/datasets/' + files[0], skip_blank_lines=True)
columns = []
dfSD = []
for file in files:
df = pd.read_csv('c:/data/datasets/' + file, skip_blank_lines=True)
if not columns:
columns = df.columns
df.columns = columns
dfSD.append(df)
totrows += df.shape[0]
print(file + " == " + str(df.shape[0]) + " rows")
dfSD = pd.concat(dfSD, axis = 0)
dfSD = dfSD.reset_index(drop = True)
Another possibility is:
import pandas as pd
import numpy as np
import os
# Need to include the header row only for the first csv (otherwise header row will be included
# for each read csv, which places improperly formatted rows into the combined dataframe).
totrows = 0
# Get list of csv files to read.
files = os.listdir('c:/data/datasets')
# Read the first csv file, including the header row.
dfSD = pd.read_csv('c:/data/datasets/' + files[0], skip_blank_lines=True)
df_comb = [dfSD]
# Now read the remaining csv files (without header row) and concatenate their values
# into our full Sales Data dataframe.
for file in files[1:]:
df = pd.read_csv('c:/data/datasets/' + file, skip_blank_lines=True, header=None)
df.columns = dfSD.columns
df_comb.append(df)
totrows += df.shape[0]
print(file + " == " + str(df.shape[0]) + " rows")
dfSD = pd.concat([df_comb], axis = 0).reset_index(drop = True)

Column name in Csv changes after merging

I have multiple csv files that i need to merge. The column names are:
idSite idVisit visitIp visitorId
However in the merged file the column 'idSite' changes to 'idSite'
This is the program i wrote. Everything else seems to be fine.
import pandas as pd
import os
dirListing = os.listdir("D:/Python/Test/Diku/piwik/filteredcsv/")
df=[]
siteIds = [34]
for id in siteIds:
for item in dirListing:
if str(id) in item:
print item
df.append(pd.read_csv(item,sep = ",",dtype='unicode'))
df3 = pd.concat(df,axis=0, ignore_index=True)
df3.to_csv('merged_' + str(id) + '_raw'+'.csv', sep =',')
Can't seem to figure out the problem. Is it a encoding issue?

Categories