Details about Cleaning Data for Effective Data Science
Cleaning Data for Effective Data Science PDF free download – There are roughly two families of problems we find in datasets. Not every problem neatly divides into these families, or at least it is not always evident which side something falls on without knowing the root cause. But in a general way, we can think of structural problems in the formaing of data versus content problems in the actual values recorded. On the structural branch a format used to encode a dataset might simply “put values in the wrong place” in one way or another. On the content side, the data format itself is correct, but implausible or wrong values have snuck in via flawed instruments, transcription errors, numeric overflows, or through other pitfalls of the recording process.
The several early chapters that discuss “data ingestion” are much more focused on structural problems in data sources, and less on numeric or content problems. It is not always cleanly possible to separate these issues, but as a question of emphasis it makes sense for the ingestion chapters to look at structural maers, and for later chapters on anomalies, data quality, feature engineering, value imputation, and model-based cleaning to direct aention to content issues.