Importing CSV in Python and manipulating the data [closed] - python

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I'm new to programming and I need to do some (maybe very basic) stuff but I'm really struggling with it.
I have some CSV-files, when opened in excel it consits of somewhat 1500 rows and 500 columns and its all numbers except for the first element of the first row (some kind of header). I need to do stuff like avareging over the elements of the first 60 rows and adding and substracting complete rows.
I'm having a bit of trouble with importing the files. When I just use readcsv and then add them to an empty dataset row bu row I get the desired format (list of rows?) but all the elements are strings instead of floats (maybe because the first element in the file is a string?) and I can't get them to convert to floats so maybe you can help me out a little.
Another thing is how do I actually manipulate a certain part of the data, like a loop going through a certain amount of rows. I can't really figure it out since mathmatical things on string dont work.
Thanks in advance for your help and comments!

I use the following and it works fine:
import numpy
csv = numpy.loadtxt('something.csv', delimiter = ',')
If you want to skip the first row, you can do like this:
csv = numpy.loadtxt('something.csv', delimiter = ',', skiprows = 1)
And if you want to operate on the first 60 rows:
X = csv[:60,:]
Then you just use X for what you want.
Hope it helps

What you need is read_csv in pandas dataframe.
Following codes will automatically recognize your header and set the headers as column names.
import pandas as pd
data = pd.read_csv('Your file name.csv')
Regarding your problem of string format of data, there is no way to help you without some sample data.
I need to do stuff like averaging over the elements of the first 60 rows and adding and subtracting complete rows.
for averaging the first 60 rows, you can do something like this:
import pandas as pd
lst1 = range(100)
lst2 = range(100,200)
lst3 = range(200,300)
data = pd.DataFrame({'a': lst1,'b': lst2,'c': lst3})
data_avrg = data[:60].mean()
In[20]:data_avrg
Out[20]:
a 29.5
b 129.5
c 229.5
dtype: float64
If you want to add or subtract the average of 60 rows to the complete rows, like all rows in column a, you can do this:
data['a_add'] = data.a + data_avrg.a
data['a_subtract'] = data.a - data_avrg.a

I don't think that if the 1st cell is string whole column is of the string type... That may be the label of that column. Try accessing the data from the 2nd row or explicitly name the column
for example
df = pd.DataFrame({'$a':[1,2], '$b': [10,20]})
print df
output
$a $b
0 1 10
1 2 20
you can change the name of the column by
df.columns = ['a', 'b']
output
a b
0 1 10
1 2 20
and after changing the name you can access the column as df['a'] or af['b']

Related

How do I take data from one csv file and find the results in another using python? [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 24 days ago.
Improve this question
I have two csv files with data in both of them. The first one has columns of data like the following:
And the second csv file looks like this
What I want to do is able to group these together based on the values in column D. I want to create a new column next to D that matches its value to the detail in the second file. The output would look something like:
File number 1 is always changing numbers and there are hundreds of numbers in each file so I don't want to write them manually in a list for the code.
Use Pandas:
import pandas as pd
df1 = pd.read_csv('first_file.csv')
df2 = pd.read_csv('second_file.csv')
df1['Details'] = df1['D'].map(df2.set_index('Number')['Details'])
# or
df1 = df1.merge(df2, left_on='D', right_on='Number', how='left')
df1.to_excel('merge_file.csv', index=False)
Update: the same version with csv module:
import csv
with (open('first_file.csv') as fp1,
open('second_file.csv') as fp2,
open('merge_file.csv', 'w') as out):
df1 = csv.DictReader(fp1)
df2 = csv.DictReader(fp2)
mapping = {row['Number']: row['Details'] for row in df2}
dfm = csv.DictWriter(out, df1.fieldnames + ['Details'])
dfm.writeheader()
for row in df1:
dfm.writerow(dict(row, Details=mapping.get(row['D'], '')))
Output:
A
B
C
D
Details
15
22
13
41
comma
52
36
67
87
carrer
91
150
41
12
recording
123
14
76
16
mold

How to create python function that performs multiple checks on a dataframe?

I have multiple inventory tables like so:
line no
-1 qty
-2 qty
1
-
3
2
42.1 FT
-
3
5
-
4
-
10 FT
5
2
1
6
6.7
-
or
line no
qty
1
2
2
4.5 KG
3
5
4
5
13
6
AR
I want to create logic check for the quantity column using python. (The table may have more than one qty column and I need to be able to check all of them. In both examples, I have the tables formatted as dataframes.)
Acceptable criteria:
integer with or without "EA" (meaning each)
"AR" (as required)
integer or float with unit of measure
if multiple QTY columns, then "-" is also accepted (first table)
I want to return a list per page, containing the line no. corresponding to rows where quantity value is missing (line 4, second table) or does not meet acceptance criteria (line 6, table 1). If the line passes the checks, then return True.
I have tried:
qty_col = [col for col in df.columns if 'qty' in col]
df['corr_qty'] = np.where(qty_col.isnull(), False, df['line_no'])
but this creates the quantity columns as a list and yields the following
AttributeError: 'list' object has no attribute 'isnull'
Intro and Suggestions:
Welcome to StackOverflow. Some general tips when asking questions on S.O. include as much information as possible. In addition, always identify the libraries you want to use and the accepted approach since there can be multiple solutions to the same problem, looks like you've done that.
Also, it is best to always share all, if not, most of your attempted solutions so others can understand the thought process and fully understand the best approach to provide a potential solution.
The Solution:
It wasn't clear if the solution you are looking for required that you read the PDF to create the dataframe or if converting the PDF to a CSV and processing the data using the CSV was sufficient. I took the latter approach.
import tabula as tb
import pandas as pd
#PDF file path
input_file_path = "/home/hackernumber7/Projects/python/resources/Pandas_Sample_Data.pdf"
#CSV file path
output_file_path = "/home/hackernumber7/Projects/python/resources/Pandas_Sample_Data.csv"
#Read the PDF
#id = tb.read_pdf(input_file_path, pages='all')
#Convert the PDF to CSV
cv = tb.convert_into(input_file_path, output_file_path, "csv", pages="all")
#Read initial data
id = pd.read_csv(output_file_path, delimiter=",")
#Print the initial data
print(id)
#Create the dataframe
df = pd.DataFrame(id, columns = ['qty'])
#Print the data as a DataFrame object; boolean values when conditions met
print(df.notna())

Could not drop NaN values using Pandas [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 4 years ago.
Improve this question
I am trying to drop NaN values using the dropna() method, provided by Panda. I've read the document and looked at other StackOverflow posts, but I still could not fix the error.
For my code, I will first read an excel file. If the rows have value “-“, I will change it to a NaN value. After that, I will use the method dropna() to drop the NaN values. I will then reassign the result of the dropna() method to a new variable called mydf2. Below are my codes and screenshots
mydf = pd.read_excel('pandas lab datasets/singstats_maritalstatus.xlsx',
na_values='-')
mydf = mydf.set_index(['Variables'])
print(mydf.head(5)) # Original data
mydf2 = mydf.dropna()
print(mydf2)
dropna() has worked correctly. You have two print statements. The first one has printed five rows as asked for by print(mydf.head(5)).
The output of your second print statement print(mydf2) is an empty dataframe [0 rows and 37 columns] because you have apparently got an NaN in each and every row. (see the bottom of your screenshot)
Sounds like here that NaN is a string, so do:
mydf2 = mydf.replace('-',np.nan).dropna()
I wrote a piece of code here, it works fine with my data, so try this out.
mydf = pd.read_excel('pandas lab datasets/singstats_maritalstatus.xlsx')
to_del = []
for i in range(mydf.shape[0]):
if "-" in list(mydf.iloc[i]):
to_del.append(i)
out_df = mydf.drop(to_del, axis=0)
As you have not posted your data, I'm not sure if every row has NaN values or not. If this is so, df.dropna() will simply drop every row. For example, the columns 1981 and 1982 are all NaN values in your image. use df.dropna(axis=1) will drop these two columns, and will not return you an empty df.
df = pd.DataFrame({'Variables':['Total','Single','Married','Widowed','Divorced/Separated'],
'1980':range(5),
'1981':[np.nan]*5})
df.set_index('Variables')
df.dropna(axis=1)

How to sort a csv file by column

I need to sort a .csv file in a very specific way but have pretty limited knowledge of python, i have got some code that works but it doesnt really do exactly what i want it to do, the format is as follows {header} {header} {header} {header}
{dataA} {dataB} {datac} {dataD}
In the csv whatever dataA is it is usually repeated 100-200 times, is there a way in which i can get dataA (e.g: examplecompany) and tell me how many times it repeats then how many times dataC repeats with dataA as the first item in the row. for example the output might be examplecompany appeared 100 times, out of those 100 datac1 appeared 45 times and datac2 appeared 55 I'm really terrible at explaining things, any help would be appreciated.
You can use csv.DictReader to read the file and then sort for the key you want.
from csv import DictReader
with open("test.csv") as f:
reader = DictReader(f)
sorted_rows = sorted(list(reader), key=lambda x: x["column1"])
CSV file I tested it with (test.csv):
column1,column2
2,bla
1,blubb
It is not clear what do you want to accomplish since you have not provided any code or a complete example of input/output for your problem.
For me, it seems that you want to count certain occurrences of data in headerC for each unique data in headerA.
Suppose you have the following .csv file:
headerA,headerB,headerC,headerD
examplecompany1,datab,datac1,datad
examplecompany2,datab,datac2,datad
examplecompany2,datab,datac1,datad
examplecompany1,datab,datac2,datad
examplecompany1,datab,datac1,datad
examplecompany2,datab,datac2,datad
examplecompany1,datab,datac1,datad
examplecompany1,datab,datac2,datad
examplecompany1,datab,datac3,datad
You can accomplish this counting with pandas. Following is an example of how you might do it.
>>> import pandas as pd
>>> df = pd.read_csv('test.csv')
>>> df.groupby(['headerA'])['headerC'].value_counts()
headerA headerC
examplecompany1 datac1 3
datac2 2
datac3 1
examplecompany2 datac2 2
datac1 1
Name: headerC, dtype: int64
Here, groupby will group the DataFrame using headerA as a reference. You can group by a single Series or a list of Series. After that, the square bracket notation is used to access the headerC column and value_counts will count each occurrence of headerC that was previously grouped by headerA. Afterwards you can just format the output for what you want.
Edit:
I forgot that you also wanted to get the number of occurrences of headerA, but that is really simple since you can get it directly by selecting the headerA column on the DataFrame df and call value_counts on it.

Force Pandas to keep multiple columns with the same name

I'm building a program that collects data and adds it to an ongoing excel sheet weekly (read_excel() and concat() with the new data). The issue I'm having is that I need the columns to have the same name for presentation (it doesn't look great with x.1, x.2, ...).
I only need this on the final output. Is there any way to accomplish this? Would it be too time consuming to modify pandas?
you can create a list of custom headers that will be read into excel
newColNames = ['x','x','x'.....]
df.to_excel(path,header=newColNames)
You can add spaces to the end of the column name. It will appear the same in a Excel, but pandas can distinguish the difference.
import pandas as pd
df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]], columns=['x','x ','x '])
df
x x x
0 1 2 3
1 4 5 6
2 7 8 9

Categories