Accessing Data from DataFrames

In my previous post, I discussed how to create pandas DataFrames in python. In this post we’ll work on accessing data from DataFrames. Firstly, let’s create the student_data_frame we created in the last post.

from pandas import DataFrame
names = ['Akhilesh', 'Ruchi','Bhawna', 'Isha']
acc_marks = [97,69,19,76]
eng_marks = [36, 85,72,68]
mat_marks = [47, 86, 41, 46]
eco_marks = [13,51,53,11]
bus_marks = [34,53,40,22]
student_data_dict = {"Name": names, "Accountancy": acc_marks, "English": eng_marks, "Maths": mat_marks, "Economics": eco_marks, "Business Studies": bus_marks}
student_data_frame = DataFrame(student_data_dict)
print(student_data_frame)

Once we are done with the above code, the student_data_frame should look like this:

Now, let’s talk about how we can access the data. We can operate on specific columns by operating on them as if they were a key in a dictionary; for example, if I wanted to retrieve the name column of the DataFrame, I’d write

student_data_frame['Name']

Which would give me this output

To get a DataFrame with more columns, I can pass a list of columns as illustrated below

column_list = ['Name', 'Maths', 'English']
student_data_frame[column_list]

And this would give the following output

I can also call on specific rows by calling on the DataFrame’s loc method and passing the row index as an argument for example; to get the first row of the DataFrame, we write

student_data_frame.loc[0]

We can use boolean operators to retrieve rows that meet certain boolean criteria. For example, to get the list of students with a mark higher than 30 in maths, we could write:

student_data_frame.loc[student_data_frame.Maths>50]

This would give us the output:

To make it more interesting, let’s say we just want the name of anyone with more than 50 in maths, to do that we’d say:

student_data_frame['Name'][student_data_frame.Maths>50]

What if we wanted the Name and the Math result? Yes you had that right, we parse the Name and Maths columns as a list as illustrated below:

wanted_columns = ['Name', 'Maths]
student_data_frame[wanted_columns][student_data_frame.Maths>50]

Another way to select rows in DataFrames is to slice them. To slice DataFrames, we use the following syntax

student_data_frame[1:4]

Which gives the output below. When slicing, the last index (in our case 4), is the item we do not want to be in the output.

You may be a little confused here, on some of the examples I used loc and on some of them, we’re not using loc but it still works. That confusion also hit me when I started. The syntaxes are “yourDataFrame[boolean_operator]” and yourDataFrame.loc[boolean_operator] and they return similar results most of the time. However, it is advised to explicitly use .loc especially where your data may contain Boolean Columns. So the rule of thumb is that if you’re not sure about the nature of your data, you should use the syntax “yourDataFrame.loc[boolean_operator]“.

Summary

NB: You can think of a DataFrame as a group of Series that share an index. This makes it easy to select specific columns that you want from the DataFrame.

Also a couple pointers:
1) Selecting a single column from the DataFrame will return a Series
2) Selecting multiple columns from the DataFrame will return a DataFrame

Row selection can be done through multiple ways. Some of the basic and common methods of accessing data from DataFrames are:
1) Slicing
2) An individual index (through the functions iloc or loc)
3) Boolean indexing

Summary taken from here

Leave a Reply