


createOrReplaceTempView ( "us_delay_flights_tbl" ) getOrCreate ()) # Path to data set csv_file = "/databricks-datasets/learning-spark-v2/flights/departuredelays.csv" # Read and create a temporary view # Infer schema (note that for larger files you # may want to specify the schema) df = ( spark. createOrReplaceTempView ( "us_delay_flights_tbl" ) # In Python from pyspark.sql import SparkSession # Create a SparkSession spark = ( SparkSession. load ( csvFile ) // Create a temporary view df. getOrCreate () // Path to data set val csvFile = "/databricks-datasets/learning-spark-v2/flights/departuredelays.csv" // Read and create a temporary view // Infer schema (note that for larger files you may want to specify the schema) val df = spark. Let’s get started by reading the data set into a temporary view: // In Scala import .SparkSession val spark = SparkSession. However, in a Spark shell (or Databricks notebook), the SparkSession is created for you and accessible via the appropriately named variable spark. Normally, in a standalone Spark application, you will create a SparkSession instance manually, as shown in the following example. Similar to the DataFrame API in its declarative flavor, this interface allows you to query structured data in your Spark applications.
#USE LABELS ON SPARK FOR MAC HOW TO#
These examples will offer you a taste of how to use SQL in your Spark applications via the spark.sql programmatic interface.
#USE LABELS ON SPARK FOR MAC CODE#
Query examples are provided in code snippets, and Python and Scala notebooks containing all of the code presented here are available in the book’s GitHub repo.

Using a schema, we’ll read the data into a DataFrame and register the DataFrame as a temporary view (more on temporary views shortly) so we can query it with SQL. It’s available as a CSV file with over a million records. In this section we’ll walk through a few examples of queries on the Airline On-Time Performance and Causes of Flight Delays data set, which contains data on US flights including date, delay, distance, origin, and destination. Supports ANSI SQL:2003-compliant commands and HiveQL. Offers an interactive shell to issue SQL queries on your structured data. Provides a programmatic interface to interact with structured data stored as tables or views in a database from a Spark application Lets you query data using JDBC/ODBC connectors from external business intelligence (BI) data sources such as Tableau, Power BI, Talend, or from RDBMSs such as MySQL and PostgreSQL. Provides the engine upon which the high-level Structured APIs we explored in Chapter 3 are built.Ĭan read and write data in a variety of structured formats (e.g., JSON, Hive tables, Parquet, Avro, ORC, CSV). This chapter and the next also explore how Spark SQL interfaces with some of the external components shown in Figure 4-1. Now, we’ll continue our discussion of the DataFrame and explore its interoperability with Spark SQL. In particular, we discussed how the Spark SQL engine provides a unified foundation for the high-level DataFrame and Dataset APIs. In the previous chapter, we explained the evolution of and justification for structure in Spark. Spark SQL and DataFrames: Introduction to Built-in Data Sources
