首页 > 数据分析 > 怎么像一个数据科学家一样思考

[悬赏]怎么像一个数据科学家一样思考 (已翻译17%)

查看 (615次)
英文原文:How to think like a Data Scientist
标签: 数据分析
admin 发布于 2017-04-06 11:50:18 (共 6 段, 本文赏金: 14元)
参与翻译(2人): greenflute melody_31 默认 | 原文

【已悬赏】 赏金: 3元

数据科学家需要必须严谨,同时注意防止(数据)缺失。 所以在改善数据科学的日常工作上,以下几点建议很值得注意:

1. 当心数据洁癖

有些问题甚至在开始处理数据之前就得考虑清楚。**这些数据有意义么?** 数据清洁度的误判会导致错误的推论,另外,通过数据的不一致性也可以分辨出很多重要的模式。例如,如果某个特定的列数值缺失超过50%,那就基本上不用考虑使用这一列了,或者数据采集手段有问题。

又比如,一个女性化妆品公司的(客户)男女比例为90:10,那也可以基本认定数据是没问题的,结论也是有效的,反之,则需要根据常识看看是不是标签搞反了。

greenflute
翻译于 2017-04-29 22:33:05
 

参与本段翻译用户:
greenflute

显示原文内容

【待悬赏】 赏金: 2元

2. Manage Outliers wisely

Outliers can help you understand more about the people who are using your website/product 24 hours a day. But including them while building models will skew the models a lot.

3. Keep an eye out for the Abnormal

Be on the lookout for something out of the obvious. If you find something you may have hit gold.

For example, Flickr started up as a Multiplayer game . Only when the founders noticed that people were using it as a photo upload service, did they pivot.

Another example: Fab.com started up as Fabulis.com, a site to help gay men meet people. One of the site's popular features was the "Gay deal of the Day". One day the deal was for Hamburgers - and half of the buyers were women. This caused the team to realize that there was a market for selling goods to women. So Fabulis pivoted to fab as a flash sale site for designer products.

共1人翻译此段 (待审批1人)


参与本段翻译用户:
melody_31


【待悬赏】 赏金: 4元

4. Start Focussing on the right metrics

  • Beware of Vanity metrics. For example, # of active users by itself doesn't divulge a lot of information. I would rather say "5% MoM increase in active users" rather than saying " 10000 active users". Even that is a vanity metric as active users would always increase. I would rather keep a track of percentage of users that are active to know how my product is performing.
  • Try to find out a metric that ties with the business goal. For example, Average Sales/User for a particular month.

5. Statistics may lie too

Be critical of everything that gets quoted to you. Statistics has been used to lie in advertisements, in workplaces and a lot of other marketing venues in the past. People will do anything to get sales or promotions.

For example: Do you believe in Colgate's claim that 80% dentists recommend their toothpaste?

This statistic seems pretty good at first. It turns out that at the time of surveying the dentists, they could choose several brands — not just one. So other brands could be just as popular as Colgate.

Another Example: 99 percent Accurate" doesn't mean shit. Ask me to create a cancer prediction model and I could give you a 99 percent accurate model in a single line of code. How? Just predict "No Cancer" for each one. I will be accurate may be more than 99% of the time as Cancer is a pretty rare disease. Yet I have achieved nothing.

共1人翻译此段 (待审批1人)


参与本段翻译用户:
melody_31


【待悬赏】 赏金: 2元

6. Understand how probability works

It happened during the summer of 1913 in a Casino in Monaco. Gamblers watched in amazement as a casino's roulette wheel landed on black 26 times in a row. And since the probability of a Red vs Black is exactly half, they were certain that red was "due". It was a field day for the Casino. A perfect example of Gambler's fallacy, aka the Monte Carlo fallacy.

And This happens in real life. People tend to avoid long strings of the same answer. Sometimes sacrificing accuracy of judgment for the sake of getting a pattern of decisions that looks fairer or probable.

For example, An admissions officer may reject the next application if he has approved three applications in a row, even if the application should have been accepted on merit.

共1人翻译此段 (待审批1人)


参与本段翻译用户:
melody_31


【待悬赏】 赏金: 2元

7. Correlation Does Not Equal Causation

The Holy Grail of a Data scientist toolbox. To see something for what it is. Just because two variables move together in tandem doesn't necessarily mean that one causes the another. There have been hilarious examples for this in the past. Some of my favorites are:

  •  Looking at the firehouse department data you infer that the more firemen are sent to a fire, the more damage is done.
  •  When investigating the cause of crime in New York City in the 80s, an academic found a strong correlation between the amount of serious crime committed and the amount of ice cream sold by street vendors! Obviously, there was an unobserved variable causing both. Summers are when the crime is the greatest and when the most ice cream is sold. So Ice cream sales don't cause crime. Neither crime increases ice cream sales.
共1人翻译此段 (待审批1人)


参与本段翻译用户:
melody_31


【待悬赏】 赏金: 1元

8. More data may help

Sometimes getting extra data may work wonders. You might be able to model the real world more closely by looking at the problem from all angles. Look for extra data sources.

For example, Crime data in a city might help banks provide a better credit line to a person living in a troubled neighborhood and in turn increase the bottom line.

共1人翻译此段 (待审批1人)


参与本段翻译用户:
melody_31

GMT+8, 2018-11-16 17:58 , Processed in 0.038765 second(s), 11 queries .