I have a py script, lets call it MergeData.py, where I merge two data files. Since I have a lot of pairs of data files that have to be merged I thought it would be good for readability reasons to put my code in MergeData.py into a function, say merge_data(), and call this function in a loop over all my pairs of data files in a different py script.
2 Questions:
Is it wise, in terms of speed, to call the function from a different file instead of runing the code directly in the loop? (I have thousands of pairs that have to be merged.)
I thought, to use the function in MergeData.py I have to include in the head of my script from MergedData import merge_data. Within the function merge_data I make use of pandas which I import in the main file by 'import pandas as pd'. When calling the function I get the error 'NameError: global name 'pd' is not defined'. I have tried all possible places to import the pandas modul, even within the function, but the error keeps popping up. What am I doing wrong?
In MergeData.py I have
def merge_data(myFile1,myFile2):
df1 = pd.read_csv(myFile1)
df2 = pd.read_csv(myFile2)
# ... my code
and in the other file I have
import pandas as pd
from MergeData import merge_data
# then some code to get my file names followed by
FileList = zip(FileList1,FileList2)
for myFile1,myFile2 in FileList:
# Run Merging Algorithm
dataEq = merge_data(myFile1,myFile2)
I am aware of What is the best way to call a Python script from another Python script?, but cannot really see if that relates to me.
You need to move the line
import pandas as pd
Into the module in which the symbol pd is actually needed, i.e. move it out of your "other file" and into your MergeData.py file.
Related
I am new to data, so after a few lessons on importing data in python, I tried the following codes in my jupter notebook but keep getting an error saying df not defined. I need help.
The code I wrote is as follows;
import pandas as pd
url = "https://api.worldbank.org/v2/en/indicator/SH.TBS.INCD?downloadformat=csv"
df = pd.read_csv(https://api.worldbank.org/v2/en/indicator/SH.TBS.INCD?downloadformat=csv)
After running the third code, I got a series of reports on jupter notebook but one that stood out was "df not defined"
The problem here is that your data is a ZIP file containing multiple CSV files. You need to download the data, unpack the ZIP file, and then read one CSV file at a time.
If you can give more details on the problem(etc: screenshots), debugging will become more easier
One possibility for the error is that the response content accessed by the url(https://api.worldbank.org/v2/en/indicator/SH.TBS.INCD?downloadformat=csv) is a zip file, which may prevent pandas from processing it further.
My dataset looks at flight delays and cancellations from 2009 to 2018. Here are the important points to consider:
Each year is its own csv file so '2009.csv', '2010.csv', all the way to '2018.csv'
Each file is roughly 700mb
I used the following to combine csv files
import pandas as pd
import numpy as np
import os, sys
import glob
os.chdir('c:\\folder'
extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
combined_airline_csv = pd.concat([pd.read_csv(f) for f in all_filenames])
combined_airline_csv.to_csv('combined_airline_csv.csv', index =False, encoding = 'utf-8-sig')
When I run this, I receive the following message:
MemoryError: Unable to allocate 43.3MiB for an array with shape(5674621, ) and data type float64.
I am presuming that my file is too large and that will need to run this on a virtual machine (i.e. AWS).
Any thoughts?
Thank you!
This is a duplicate of how to merge 200 csv files in Python.
Since you just want to combine them into one file, there is no need to load all data into a dataframe at the same time. Since they all have the same structure, I would advise creating one filewriter, then open each file with a file reader and write (if we want to be fancy let's call it stream) the data line by line. Just be careful not to copy the headers each time, since you only want them one time. Pandas is simply not the best tool for this task :)
In general, this is a typical task that can also be done easily and even faster directly on the command line. (code depends on the os)
I have a 1 column excel file. I want to import all the values it has in a variable x (something like x=[1,2,3,4.5,-6.....]), then use this variable to run numpy.correlate(x,x,mode='full') to get autocorrelation, after I import numpy.
When I manually enter x=[1,2,3...], it does the job fine, but when I try to copy paste all the values in x=[], it gives me a NameError: name 'NO' is not defined.
Can someone tell me how to go around doing this?
You can use Pandas to import a CSV file with the pd.read_csv function.
I am trying to read a pandas dataframe and perform certain operations and return the dataframe. I also want to multiprocess the operation to take advantage of multiple cores that my system has.
import pandas as pd
import re
from jug import TaskGenerator
#TaskGenerator
def find_replace(input_path_find):
start_time = time.clock()
df_find = pd.read_csv(input_path_find)
df_find.currentTitle=df_find.currentTitle.str.replace(r"[^a-zA-Z0-9`~!|##%&_}={:\"\];<>,./. -]",r'')
#extra space
df_find.currentTitle=df_find.currentTitle.str.replace('\s+', ' ')
#length
df_find['currentTitle_sort'] = df_find.currentTitle.str.len()
#sort
df_find = df_find.sort_values(by='currentTitle_sort',ascending=0)
#reindx
df_find.reset_index(drop=True,inplace=True)
del df_find['currentTitle_sort']
return df_find
When i pass the parameter which is CSV file i want to process
df_returned = find_replace('C:\\Users\\Dell\\Downloads\\Find_Replace_in_this_v1.csv')
I am getting some weird output
find_replace
Task(__main__.find_replace, args=('C:\\Users\\Dell\\Downloads\\Find_Replace_in_this_v1.csv',), kwargs={})
In [ ]:
Any help? I basically want to save the output from the function
I have already checked the answer and it didn't work. Also, i am using pythono 2.7 and anaconda IDE Pandas memoization
This is a misunderstanding of how jug works.
The result you are getting is, indeed a Task object, which you can run: df_returned.run().
Usually, though, you would have saved this script to a file (say analysis.py) and called jug execute analysis.py to execute the tasks.
The code below generates a sum from the "Value" column in an ndarray called 'File1.csv'.
How do I apply this code to every file in a directory and place the sums in a new file called Sum.csv?
import pandas as pd
import numpy as np
df = pd.read_csv("~/File1.csv")
df["Value"].sum()
Many thanks!
There's probably a nice way to do this with a pandas Panel, but this is a basic python implementation.
import os
import pandas as pd
# Get the home directory (not recommended, work somewhere else)
directory = os.environ["HOME"]
# Read all files in directory, filter out non-csv
files = [os.path.join(directory, f)
for f in os.listdir(directory) if f.endswith(".csv")]
# Make list of tuples [(filename, sum)]
sums = [(filename, pd.read_csv(filename)["Value"].sum())
for filename in files ]
# Make a dataframe
df = pd.DataFrame(sums, columns=["filename", "sum"])
df.to_csv(os.path.join(directory, "files_with_sum.csv"))
Note that the built in python os.listdir() doesn't understand "~/" like pandas does, so we get it out of the environment map. Using the home directory isn't really recommended, so this gives any adopter of this code an opportunity to set a different path.