scala - inferSchema in spark-csv package -
when csv read dataframe in spark, columns read string. there way actual type of column?
i have following csv file
name,department,years_of_experience,dob sam,software,5,1990-10-10 alex,data analytics,3,1992-10-10
i've read csv using below code
val df = sqlcontext. read. format("com.databricks.spark.csv"). option("header", "true"). option("inferschema", "true"). load(sampleaddatas3location) df.schema
all columns read string. expect column years_of_experience read int , dob read date
please note i've set option inferschema true.
i using latest version (1.0.3) of spark-csv package
am missing here?
2015-07-30
the latest version 1.1.0, doesn't matter since looks inferschema
is not included in latest release.
2015-08-17
the latest version of package 1.2.0 (published on 2015-08-06) , schema inference works expected:
scala> df.printschema root |-- name: string (nullable = true) |-- department: string (nullable = true) |-- years_of_experience: integer (nullable = true) |-- dob: string (nullable = true)
regarding automatic date parsing doubt ever happen, or @ least not without providing additional metadata.
even if fields follow date-like format impossible if given field should interpreted date. either lack of out automatic date inference or spreadsheet mess. not mention issues timezones example.
finally can parse date string manually:
sqlcontext .sql("select *, date(dob) dob_d df") .drop("dob") .printschema root |-- name: string (nullable = true) |-- department: string (nullable = true) |-- years_of_experience: integer (nullable = true) |-- dob_d: date (nullable = true)
so not serious issue.
Comments
Post a Comment