Python/pandas: Column value in list
I’ve been using Python’s pandas library while exploring some CSV files and although for the most part I’ve found it intuitive to use, I had trouble filtering a data frame based on checking whether a column value was in a list.
A subset of one of the CSV files I’ve been working with looks like this:
01 02 03 04 05 06 07 08 09 10 11 12 | $ cat foo.csv 'Foo' 1 2 3 4 5 6 7 8 9 10 |
Loading it into a pandas data frame is reasonably simple:
01 02 03 04 05 06 07 08 09 10 11 12 13 14 | import pandas as pd df = pd.read_csv( 'foo.csv' , index_col = False , header = 0 ) >>> df Foo 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 |
If we want to find the rows which have a value of 1 we’d write the following:
1 2 3 | >>> df[df[ 'Foo' ] = = 1 ] Foo 0 1 |
Finding the rows with a value less than 7 is as you’d expect too:
1 2 3 4 5 6 7 8 | >>> df[df[ 'Foo' ] < 7 ] Foo 0 1 1 2 2 3 3 4 4 5 5 6 |
Next I wanted to filter out the rows containing odd numbers which I initially tried to do like this:
01 02 03 04 05 06 07 08 09 10 | odds = [i for i in range ( 1 , 10 ) if i % 2 <> 0 ] >>> odds [ 1 , 3 , 5 , 7 , 9 ] >>> df[df[ 'Foo' ] in odds] Traceback (most recent call last): File '<stdin>' , line 1 , in <module> File '/Users/markneedham/projects/neo4j-himym/himym/lib/python2.7/site-packages/pandas/core/generic.py' , line 698 , in __nonzero__ . format ( self .__class__.__name__)) ValueError: The truth value of a Series is ambiguous. Use a.empty, a. bool (), a.item(), a. any () or a. all (). |
Unfortunately that doesn’t work and I couldn’t get any of the suggestions from the error message to work either. Luckily pandas has a special isin function for this use case which we can call like this:
1 2 3 4 5 6 7 | >>> df[df[ 'Foo' ].isin(odds)] Foo 0 1 2 3 4 5 6 7 8 9 |
Much better!
Reference: | Python/pandas: Column value in list from our WCG partner Mark Needham at the Mark Needham Blog blog. |