Skip to the content.

Csv Schema Inference

A tool to automatically infer columns data types in .csv files

Check the article here: Building a Schema Inference Data Pipeline for Large CSV files

## **Installing csv-schema-inference** πŸ”§
``` python pip install csv-schema-inference ```
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/ Collecting csv-schema-inference Downloading csv_schema_inference-0.0.9-py3-none-any.whl (7.3 kB) Installing collected packages: csv-schema-inference Successfully installed csv-schema-inference-0.0.9
## **Importing csv-schema-inference library** ⚑
``` python from csv_schema_inference import csv_schema_inference ```
## **Setting csv-schema-inference configuration** ✍
``` python #if the inferred data type is INTEGER and there is a presence of FLOAT on the results , then the result will be FLOAT conditions = {"INTEGER":"FLOAT"} csv_infer = csv_schema_inference.CsvSchemaInference(portion=0.9, max_length=100, batch_size = 200000, acc = 0.8, seed=2, header=True, sep=",", conditions = conditions) pathfile = "/content/file__500k.csv" ```
## **Run inference** πŸƒ
``` python aprox_schema = csv_infer.run_inference(pathfile) ```
## **Showing the approximate data type inference for each column** πŸ”
``` python csv_infer.pretty(aprox_schema) ```
0 name id type INTEGER nullable False 1 name full_name type STRING nullable True 2 name age type INTEGER nullable False 3 name city type STRING nullable True 4 name weight type FLOAT nullable False 5 name height type FLOAT nullable False 6 name isActive type BOOLEAN nullable False 7 name col_int1 type INTEGER nullable False 8 name col_int2 type INTEGER nullable False 9 name col_int3 type INTEGER nullable False 10 name col_float1 type FLOAT nullable False 11 name col_float2 type FLOAT nullable False 12 name col_float3 type FLOAT nullable False 13 name col_float4 type FLOAT nullable False 14 name col_float5 type FLOAT nullable False 15 name col_float6 type FLOAT nullable False 16 name col_float7 type FLOAT nullable False 17 name col_float8 type FLOAT nullable False 18 name col_float9 type FLOAT nullable False 19 name col_float10 type FLOAT nullable False 20 name test_column type FLOAT nullable False
## **Checking schema values for specific columns** βœ”
``` python result = csv_infer.get_schema_columns(columns = {"test_column"}) csv_infer.pretty(result) ```
20 _name test_column types_found INTEGER cnt 406130 FLOAT cnt 50964 nullable False type FLOAT
## **Explore all possible data types for a specific columns** βœ…
``` python result = csv_infer.explore_schema_column(column = "test_column") csv_infer.pretty(result) ```
20 name test_column types_found INTEGER 88.85043339006856 FLOAT 11.149566609931437 nullable False

Benchmark

The tests were done with 9 .csv files, 21 columns, different sizes and number of records, an average of 5 executions was calculated for each process, shuffle time and inferring time.

If you want to know more about the shuffling process, you can check this other repository: A tool to automatically Shuffle lines in .csv files, the shuffling process helps us to:

  1. Increase the probability of finding all the data types present in a single column.
  2. Avoid iterate the entire dataset.
  3. Avoid see biases in the data that may be part of its organic behavior and due to not knowing the nature of its construction.

Contributing and Feedback

Any ideas or feedback about this repository?. Help me to improve it.

Authors

License

This project is licensed under the terms of the MIT License.